In-House & Private AIPrivate LLMCASE_12

Running a production LLM on-premise for a defence contractor with no cloud access

A European tier-1 defence contractor (engagement 2024) whose security classification forbids any AI inference on hardware connected to the internet, every commercial cloud API ruled out on day one. The internal IT team had no prior ML infrastructure experience. Discovery & Blueprint scoped the hardware constraints first: two NVIDIA A6000 GPUs (48GB VRAM each). We selected Llama 3 8B quantised to 4-bit (GGUF) running via llama.cpp; built a simple REST API wrapper, a web interface for analysts, and a deployment package the IT team could install and operate without ML expertise. p95 inference latency 4.2s on the target hardware. IT team achieved independent operation 3 weeks after handoff. No cloud dependency at any point in the pipeline.

100%

Air-gapped deployment

4.2s

p95 inference latency

3wk

Handoff to IT independence

The Challenge

The client's security classification required all AI inference to occur on hardware physically located in a secure facility with no internet connectivity. Commercial cloud APIs were completely ruled out. The internal IT team had no prior ML infrastructure experience.

Our Approach

Discovery & Blueprint was the critical phase, we scoped the hardware constraints, model size limits, and inference requirements before recommending any model. The client had two NVIDIA A6000 GPUs (48GB VRAM each). We selected Llama 3 8B quantised to 4-bit (GGUF format), running via llama.cpp on the available hardware. We built a simple REST API wrapper, a web interface for analysts, and a deployment package that the IT team could install and operate without ML expertise.

Outcome

Fully air-gapped LLM deployment running in production. p95 inference latency: 4.2 seconds per query on the target hardware. IT team achieved independent operation within 3 weeks of handoff. No cloud dependency at any point in the pipeline.

What We Learned

Hardware constraints must drive model selection, not the other way around.

Quantisation makes LLMs viable on non-specialist hardware.

IT handoff requires a deployment package, not just documentation.

Stages Engaged

Discovery & Blueprint

Concept Validation

Total Duration

2.5 months total

Artifacts Delivered

PRD

Hardware Specification

Deployment Package

IT Operations Runbook

Security Architecture Review

Start with a Feasibility Call

2 hours. No cost. We'll tell you honestly whether AI makes sense for your case.

Book a call