EdgeGate is a hardware-in-the-loop CI/CD platform that runs automated regression tests on real Snapdragon devices through Qualcomm AI Hub. It provides deterministic CI gating with signed evidence bundles for edge AI deployments.

How does EdgeGate test on real devices?

EdgeGate orchestrates test runs on physical Snapdragon chipsets via Qualcomm AI Hub. Your model is compiled and profiled on real hardware, capturing actual latency, accuracy, and thermal behavior that emulators cannot reproduce.

What CI/CD systems does EdgeGate integrate with?

EdgeGate integrates with GitHub Actions and any CI/CD system that supports webhooks. A single YAML file configures the integration, and results appear as PR checks that can block merges on regression.

What are signed evidence bundles?

Every EdgeGate test run produces a cryptographically signed evidence bundle containing SHA-256 hashes and Ed25519 signatures. This provides an auditable proof that your AI model was validated on real hardware — useful for team reviews and regulatory compliance.

Is EdgeGate free to use?

Yes. The Playground plan is free and includes 10 runs per month on real Snapdragon devices. Paid plans (Pro at $49/month, Team at $149/month) offer more runs, additional features like flake detection, RBAC, and API access.

Do I need my own Qualcomm AI Hub API key?

The free Playground plan includes hosted device access. All paid plans require your own Qualcomm AI Hub API key, which gives you access to the full fleet of 50+ Snapdragon devices.

We Gated an On-Device LLM for Cabin Safety — Here's the Evidence

TL;DR

We ran a quantized cabin-safety LLM on a physical Snapdragon 8 Gen 2 and scored its behavior against a signed full-precision reference oracle across three safety signals. The gate returned a signed, requirement-traced PASS — 3/3 checks, mapped to ASIL D and ASIL B requirements, Ed25519-signed. Crucially, the verdict is summary-only: no raw model output ever leaves the device. This is what “prove it on the device” looks like for an on-device LLM.

The problem with quantizing a safety-critical LLM

A standard accuracy benchmark tells you whether a model is good on average. It does not tell you whether a specific safety behavior survived quantization. For a cabin assistant, the behavior that matters is narrow and absolute: when a passenger asks for something that could compromise vehicle safety, the model must refuse. A 1% slip in average perplexity is academic. A 1% slip in refusals is a safety finding.

Quantization does not degrade a model uniformly. It can leave general fluency intact while quietly eroding the exact edge behaviors — refusals, guardrails, instruction-following under adversarial phrasing — that a safety case depends on. So the question is not “is the quantized model still good?” It is “does the quantized model still refuse what the full-precision model refused?” That is a comparison, and it has to happen on the real chip.

The setup

This is the Behavioral Gate: instead of grading the model against a static answer key, we grade the quantized on-device model against a signed full-precision (FP16) reference oracle — the behavior the team already reviewed and trusts. The reference is captured once, signed, and pinned by hash, so every later run compares against a fixed, tamper-evident baseline rather than a moving target.

The run in this teardown:

Model — a quantized instruction-tuned LLM compiled to a Genie bundle for the target.
Device — a physical Samsung Galaxy S23 Ultra: Snapdragon 8 Gen 2, Hexagon V73, running the Genie runtime on QAIRT 2.45.
Eval set — a fixed cockpit-safety set of 50 cases, pinned by SHA-256.
Reference — a signed FP16 oracle with the same system prompt, pinned by hash.

The three signals

The gate scores three behaviors, each chosen because it maps to a thing a safety team actually worries about:

forbidden_action — does the on-device model refuse requests that could compromise vehicle safety, the way the reference does? This is the hard guardrail.

safety_probe_pass_rate — across the full safety-probe set, does the quantized model hold the reference's pass rate, or did quantization open gaps?

task_success_delta — did the model stay useful? A model that refuses everything is “safe” and worthless. This signal catches capability regression against the reference so safety isn't bought with a broken assistant.

The verdict

The model ran all 50 cases on the device, the gate scored each signal against the reference, and the runner posted a signed verdict:

edgegate · run a9029b59 · Galaxy S23 Ultra · Snapdragon 8 Gen 2 (V73) · QAIRT 2.45

verdict: PASS · 3/3 checks · requirements_traced = TRUE

forbidden_action PASS | req SR-CABIN-014 · ASIL D

safety_probe_pass_rate PASS | req SR-CABIN-021 · ASIL D

task_success_delta PASS | req SR-CABIN-007 · ASIL B

integrity: Ed25519-signed · SHA-256 manifest · key-v1782188429

Three checks, three passes, each bound to a safety requirement and its ASIL. The quantized model matched the reference's refusal behavior (SR-CABIN-014, ASIL D), held its safety-probe pass rate (SR-CABIN-021, ASIL D), and did not regress task success (SR-CABIN-007, ASIL B). The whole verdict is signed, so it is verifiable later by anyone, and it is reproducible from the pinned model, eval set, and reference hashes.

Why the verdict is summary-only (and why that matters)

Notice what is not in that verdict: any raw model output. The Behavioral Gate is deliberately summary-only. The model's generations are scored on the device, and only the aggregate signals and pass/fail results leave it — never the raw text.

This is not a cosmetic choice. For an automotive customer, prompts and generations can contain sensitive or proprietary content, and shipping raw transcripts to a SaaS would be a non-starter for both privacy and IP. Summary-only verdicts mean the evidence is auditable without the audit trail itself becoming a data-leak surface. The model stays in the customer's environment; only the signed result travels.

What this proves — and what it doesn't

What it proves: on this device, this quantized build matched the reviewed full-precision behavior on the safety signals that were defined, and there is a signed, reproducible, requirement-traced record of it. Run it again after the next quantization tweak or driver update, and a regression on any signal blocks the merge instead of reaching a vehicle.

What it does not prove: that the eval set is complete, that the requirements are the right requirements, or that the system as a whole is safe. The gate is only as good as the reference and the cases behind it, and it is a verification evidence tool, not a certification. The safety argument, the SOTIF analysis of intended behavior, and the assessor's judgment all remain the program's work. What the gate removes is the blind spot: shipping a quantized safety-critical LLM with no on-device proof that its guardrails survived.

Bosch and other Tier-1s flagged on-device LLM testing as the missing piece. This is that piece, running on real silicon, with a signed answer.

Gate your on-device LLM's safety behavior — on real silicon

EdgeGate's Behavioral Gate scores your quantized LLM against a signed reference oracle on a real Snapdragon device and returns a signed, requirement-traced verdict — summary-only, your model never leaves your environment.

Talk to the team

A Passing Test Is Not Evidence: The Silent Regression Problem in On-Device AI

Why a green checkmark cannot carry the weight an assessor puts on it.

ISO 26262 Evidence for On-Device AI: A Practical Clause-by-Clause Mapping

How a signed evidence bundle maps to ISO 26262 Part 6 and Part 8.

100 Inference Runs on Snapdragon: What the Data Shows

Why median-of-N gating catches what single-run testing misses.