Auto-generating 10,000 practice questions from curriculum documents at exam-quality standard
A K-12 exam-prep company (engagement late 2023, ~$800k/year prior spend on manual question authoring) had tried wrapping GPT-4 directly. Subject matter experts were rejecting ~60% of AI-generated questions as "too easy" or "ambiguous", they tested recall, not understanding. Our pipeline forces Bloom's-taxonomy-aware generation (curriculum parsing → cognitive-level-tagged prompt construction → automated quality filter trained on the client's own accepted-vs-rejected examples). Expert acceptance reached 91% by the end of Concept Validation; cost per accepted question −74% vs. manual. The pipeline is now used as a first-draft tool, experts refine the top 9% rather than writing from scratch.
10K
Questions/month generated
91%
Accepted by subject experts
−74%
Cost vs. manual authoring
The Challenge
The client had tried GPT-4 directly. The questions were grammatically correct but educationally shallow, they tested recall, not understanding. Subject matter experts were rejecting 60% of AI-generated questions as "too easy" or "ambiguous."
The core problem: LLMs without pedagogical structure produce surface-level questions. The model needed to understand Bloom's taxonomy and generate questions at the right cognitive level for each curriculum objective.
Our Approach
Discovery & Blueprint produced a generation pipeline with three stages: (1) curriculum parsing to extract learning objectives and Bloom's level; (2) structured prompt construction that forces the LLM to generate at a specified cognitive level; (3) an automated quality filter trained on the client's own question bank (accepted vs. rejected examples).
Concept Validation ran on 3 subject areas. Expert acceptance rate went from 40% to 91% by the end of validation.
Outcome
10,000 questions generated per month at a 91% expert acceptance rate. Cost per accepted question reduced 74% vs. manual authoring. The pipeline is now used as a first-draft tool, subject matter experts refine the top 9% of rejections rather than writing from scratch.
What We Learned
01
LLMs need pedagogical scaffolding, not just a prompt, Bloom's taxonomy is the structure.
02
Training the quality filter on your own acceptance data is more effective than manual rubrics.
03
The right goal is "expert-in-the-loop," not "expert replaced."
Stages Engaged
Feasibility Call
Discovery & Blueprint
Concept Validation
Total Duration
3 months total
Artifacts Delivered
PRD
Generation Pipeline Spec
Bloom's Taxonomy Integration Guide
WBS
Quality Filter Training Dataset
Start with a Feasibility Call
2 hours. No cost. We'll tell you honestly whether AI makes sense for your case.