Skip to content
In-House & Private AIRAG & Knowledge BaseCASE_09

Building a private knowledge base for a 40,000-document regulatory library

A national-tier financial regulator (engagement 2024) with a ~40,000-document library, legislation, guidance notes, enforcement decisions, consultation papers, spanning 20 years and multiple jurisdictions. Analysts were spending 3+ hours per query just to find the relevant passages. Our hybrid retrieval (dense semantic via a locally-hosted multilingual-E5 + sparse BM25 for regulatory citation accuracy) combined with structure-preserving chunking that keeps section, sub-section, and cross-reference links intact. Locally-hosted LLM synthesises retrieved passages with source citations. 94% query accuracy on a 500-question benchmark scored by senior analysts; average research time 3 hours → 8 minutes. Every answer traceable to specific document, section, and page. All infrastructure runs inside the regulator's perimeter.

94%
Query accuracy
3hr→8min
Average research time
0
Documents leave perimeter

The Challenge

The regulatory library contained 40,000 documents, legislation, guidance notes, enforcement decisions, consultation papers, spanning 20 years and multiple jurisdictions. Keyword search produced too many results. Analysts were spending 3+ hours per regulatory research query just to find the relevant passages.

Our Approach

We built a RAG (Retrieval-Augmented Generation) system on top of the document library. The architecture had three key components: (1) a document chunking and embedding pipeline that preserves regulatory structure (section, sub-section, cross-reference links); (2) a hybrid retrieval layer combining dense semantic search with sparse BM25 for regulatory citation accuracy; (3) a locally-hosted LLM that synthesises retrieved passages with source citations. All components run on the organisation's internal infrastructure, no data leaves the perimeter.

Outcome

94% query accuracy on a benchmark of 500 representative research questions assessed by senior analysts. Average research time reduced from 3 hours to 8 minutes. System handles 1,200+ queries per day. The citation layer means every answer is traceable to specific document, section, and page.

What We Learned

01

Hybrid retrieval (dense + sparse) outperforms either alone for regulatory document search.

02

Preserving document structure during chunking is as important as the embedding model.

03

Citations are not optional for knowledge workers, they're the whole point.

Stages Engaged
Discovery & Blueprint
Concept Validation
Production Build
Total Duration
4 months total
Artifacts Delivered
PRD
RAG Architecture Blueprint
Document Processing Pipeline Spec
WBS
IT Runbook
Start with a Feasibility Call

2 hours. No cost. We'll tell you honestly whether AI makes sense for your case.

Book a call