BenchmarksJune 23, 2026·9 min read

We Gated an On-Device LLM for Cabin Safety — Here's the Evidence

An in-cabin voice assistant has to refuse the unsafe request every time. Quantize it for the NPU and “every time” can quietly become “most of the time.” Here is a real run that checks — on the actual chip — and signs the answer.

EdgeGate Team

EdgeGate Engineering Team

Edge AI CI/CD platform · Qualcomm AI Hub integration partners

TL;DR

We ran a quantized cabin-safety LLM on a physical Snapdragon 8 Gen 2 and scored its behavior against a signed full-precision reference oracle across three safety signals. The gate returned a signed, requirement-traced PASS — 3/3 checks, mapped to ASIL D and ASIL B requirements, Ed25519-signed. Crucially, the verdict is summary-only: no raw model output ever leaves the device. This is what “prove it on the device” looks like for an on-device LLM.

The problem with quantizing a safety-critical LLM

A standard accuracy benchmark tells you whether a model is good on average. It does not tell you whether a specific safety behavior survived quantization. For a cabin assistant, the behavior that matters is narrow and absolute: when a passenger asks for something that could compromise vehicle safety, the model must refuse. A 1% slip in average perplexity is academic. A 1% slip in refusals is a safety finding.

Quantization does not degrade a model uniformly. It can leave general fluency intact while quietly eroding the exact edge behaviors — refusals, guardrails, instruction-following under adversarial phrasing — that a safety case depends on. So the question is not “is the quantized model still good?” It is “does the quantized model still refuse what the full-precision model refused?” That is a comparison, and it has to happen on the real chip.

The setup

This is the Behavioral Gate: instead of grading the model against a static answer key, we grade the quantized on-device model against a signed full-precision (FP16) reference oracle — the behavior the team already reviewed and trusts. The reference is captured once, signed, and pinned by hash, so every later run compares against a fixed, tamper-evident baseline rather than a moving target.

The run in this teardown:

Model — a quantized instruction-tuned LLM compiled to a Genie bundle for the target.
Device — a physical Samsung Galaxy S23 Ultra: Snapdragon 8 Gen 2, Hexagon V73, running the Genie runtime on QAIRT 2.45.
Eval set — a fixed cockpit-safety set of 50 cases, pinned by SHA-256.
Reference — a signed FP16 oracle with the same system prompt, pinned by hash.

The three signals

The gate scores three behaviors, each chosen because it maps to a thing a safety team actually worries about:

forbidden_action — does the on-device model refuse requests that could compromise vehicle safety, the way the reference does? This is the hard guardrail.

safety_probe_pass_rate — across the full safety-probe set, does the quantized model hold the reference's pass rate, or did quantization open gaps?

task_success_delta — did the model stay useful? A model that refuses everything is “safe” and worthless. This signal catches capability regression against the reference so safety isn't bought with a broken assistant.

The verdict

The model ran all 50 cases on the device, the gate scored each signal against the reference, and the runner posted a signed verdict:

edgegate · run a9029b59 · Galaxy S23 Ultra · Snapdragon 8 Gen 2 (V73) · QAIRT 2.45
verdict: PASS · 3/3 checks · requirements_traced = TRUE
forbidden_action PASS | req SR-CABIN-014 · ASIL D
safety_probe_pass_rate PASS | req SR-CABIN-021 · ASIL D
task_success_delta PASS | req SR-CABIN-007 · ASIL B
integrity: Ed25519-signed · SHA-256 manifest · key-v1782188429

Three checks, three passes, each bound to a safety requirement and its ASIL. The quantized model matched the reference's refusal behavior (SR-CABIN-014, ASIL D), held its safety-probe pass rate (SR-CABIN-021, ASIL D), and did not regress task success (SR-CABIN-007, ASIL B). The whole verdict is signed, so it is verifiable later by anyone, and it is reproducible from the pinned model, eval set, and reference hashes.

Why the verdict is summary-only (and why that matters)

Notice what is not in that verdict: any raw model output. The Behavioral Gate is deliberately summary-only. The model's generations are scored on the device, and only the aggregate signals and pass/fail results leave it — never the raw text.

This is not a cosmetic choice. For an automotive customer, prompts and generations can contain sensitive or proprietary content, and shipping raw transcripts to a SaaS would be a non-starter for both privacy and IP. Summary-only verdicts mean the evidence is auditable without the audit trail itself becoming a data-leak surface. The model stays in the customer's environment; only the signed result travels.

What this proves — and what it doesn't

What it proves: on this device, this quantized build matched the reviewed full-precision behavior on the safety signals that were defined, and there is a signed, reproducible, requirement-traced record of it. Run it again after the next quantization tweak or driver update, and a regression on any signal blocks the merge instead of reaching a vehicle.

What it does not prove: that the eval set is complete, that the requirements are the right requirements, or that the system as a whole is safe. The gate is only as good as the reference and the cases behind it, and it is a verification evidence tool, not a certification. The safety argument, the SOTIF analysis of intended behavior, and the assessor's judgment all remain the program's work. What the gate removes is the blind spot: shipping a quantized safety-critical LLM with no on-device proof that its guardrails survived.

Bosch and other Tier-1s flagged on-device LLM testing as the missing piece. This is that piece, running on real silicon, with a signed answer.

Gate your on-device LLM's safety behavior — on real silicon

EdgeGate's Behavioral Gate scores your quantized LLM against a signed reference oracle on a real Snapdragon device and returns a signed, requirement-traced verdict — summary-only, your model never leaves your environment.

Talk to the team