In-House & Private AIRAG & Knowledge BaseCASE_09

Building a private knowledge base for a 40,000-document regulatory library

A national-tier financial regulator (engagement 2024) with a ~40,000-document library, legislation, guidance notes, enforcement decisions, consultation papers, spanning 20 years and multiple jurisdictions. Analysts were spending 3+ hours per query just to find the relevant passages. Our hybrid retrieval (dense semantic via a locally-hosted multilingual-E5 + sparse BM25 for regulatory citation accuracy) combined with structure-preserving chunking that keeps section, sub-section, and cross-reference links intact. Locally-hosted LLM synthesises retrieved passages with source citations. 94% query accuracy on a 500-question benchmark scored by senior analysts; average research time 3 hours → 8 minutes. Every answer traceable to specific document, section, and page. All infrastructure runs inside the regulator's perimeter.

94%

Query accuracy

3hr→8min

Average research time

Documents leave perimeter

The Challenge

The regulatory library contained 40,000 documents, legislation, guidance notes, enforcement decisions, consultation papers, spanning 20 years and multiple jurisdictions. Keyword search produced too many results. Analysts were spending 3+ hours per regulatory research query just to find the relevant passages.

Our Approach

We built a RAG (Retrieval-Augmented Generation) system on top of the document library. The architecture had three key components: (1) a document chunking and embedding pipeline that preserves regulatory structure (section, sub-section, cross-reference links); (2) a hybrid retrieval layer combining dense semantic search with sparse BM25 for regulatory citation accuracy; (3) a locally-hosted LLM that synthesises retrieved passages with source citations. All components run on the organisation's internal infrastructure, no data leaves the perimeter.

Outcome

94% query accuracy on a benchmark of 500 representative research questions assessed by senior analysts. Average research time reduced from 3 hours to 8 minutes. System handles 1,200+ queries per day. The citation layer means every answer is traceable to specific document, section, and page.

What We Learned

Hybrid retrieval (dense + sparse) outperforms either alone for regulatory document search.

Preserving document structure during chunking is as important as the embedding model.

Citations are not optional for knowledge workers, they're the whole point.

Stages Engaged

Discovery & Blueprint

Concept Validation

Production Build

Total Duration

4 months total

Artifacts Delivered

PRD

RAG Architecture Blueprint

Document Processing Pipeline Spec

WBS

IT Runbook

Start with a Feasibility Call

2 hours. No cost. We'll tell you honestly whether AI makes sense for your case.

Book a call