A European tier-1 defence contractor (engagement 2024) whose security classification forbids any AI inference on hardware connected to the internet, every commercial cloud API ruled out on day one. The internal IT team had no prior ML infrastructure experience. Discovery & Blueprint scoped the hardware constraints first: two NVIDIA A6000 GPUs (48GB VRAM each). We selected Llama 3 8B quantised to 4-bit (GGUF) running via llama.cpp; built a simple REST API wrapper, a web interface for analysts, and a deployment package the IT team could install and operate without ML expertise. p95 inference latency 4.2s on the target hardware. IT team achieved independent operation 3 weeks after handoff. No cloud dependency at any point in the pipeline.
Hardware constraints must drive model selection, not the other way around.
Quantisation makes LLMs viable on non-specialist hardware.
IT handoff requires a deployment package, not just documentation.
2 hours. No cost. We'll tell you honestly whether AI makes sense for your case.
Book a call