HealthcareClinical AICASE_05

Achieving 99.7% accuracy on clinical note summarisation for regulatory compliance

A US-based clinical-documentation SaaS (engagement 2023–24, deployed Q1 2024) whose previous vendor had plateaued at ~96% despite months of iteration. The regulator floor was 99.5% and the gap from 96% to 99.5%+ is architectural, not a tuning problem, at that level the question is which cases the model is uncertain about, not whether the model is "good." Our multi-gate validation system routes every summary through factual-consistency (against the source note), medical-entity (against a SNOMED slice), completeness, and confidence-scoring checks; the bottom ~0.3% of confidence scores route to a human reviewer queue. 99.7% production accuracy over 6 months and 1.4M summaries, ~8.4k/day. Regulatory submission accepted on first review.

Architect debrief: "At 96% the model is the problem. At 99.5%+ the model is fine, your uncertainty quantification is the problem."

99.7%

Production accuracy

8K+

Notes/day

100%

Audit trail coverage

The Challenge

Clinical note summarisation at 99.5%+ accuracy is a genuinely hard problem. The gap between 96% and 99.5% is not a model tuning problem, at that level, it's architectural. You need to know which cases the model is uncertain about and handle them differently.

Our Approach

Our architecture introduced a multi-gate validation system. Every summary passes through four sequential checks: factual consistency (token-level alignment against the source note), medical entity validation (entities mentioned must exist in the source and resolve in a SNOMED slice), completeness (no required section omitted), and confidence scoring. Low-confidence cases route to a human reviewer queue. The scar: our first cut of the medical-entity gate false-flagged abbreviations clinicians actually use ("LBP" for low back pain, "SOB" for shortness of breath). Validation week 4 we rebuilt it as a domain-tuned NER + canonicalisation pass that knows the abbreviation conventions of the deployed specialty. The model was fine-tuned on ~15,000 annotated clinical notes. We also annotated the failures, what a wrong summary looks like, which gave the confidence gate enough signal to catch edge cases.

Outcome

99.7% production accuracy over 6 months on 1.4M summaries. The human reviewer queue handles ~0.3% of cases. Full audit trail for every summary. Regulatory submission accepted on first review.

What We Learned

At very high accuracy thresholds, the problem shifts to uncertainty quantification.

A well-designed human fallback is how you get from 99% to 99.7%.

Domain expert annotation of failures is as valuable as annotation of successes.

Stages Engaged

Discovery & Blueprint

Concept Validation

Production Build

Total Duration

6 months total

Artifacts Delivered

PRD

Multi-Gate Architecture Spec

Clinical Validation Protocol

WBS

Audit Trail Design

Start with a Feasibility Call

2 hours. No cost. We'll tell you honestly whether AI makes sense for your case.

Book a call