TL;DR
A passing test tells you a check ran and returned green. It does not tell you which model ran, on which silicon, against which inputs, or whether the result can be reproduced and verified later. For cloud software that gap is tolerable. For AI running on a physical device in a car, a medical instrument, or a drone, it is the whole problem. The fix is to stop treating the dashboard as the deliverable and start treating signed, reproducible evidence as the deliverable.
The silent regression
You train and validate a model in the cloud at full precision. It is accurate, it behaves, it ships through review. Then it gets quantized to INT8 or INT4 to fit the NPU's memory and power budget, compiled for the target chipset, and flashed onto the device.
Somewhere in that pipeline, the model can quietly change behavior. Quantization error accumulates in normalization layers. An unsupported operator silently falls back from the NPU to the CPU, quadrupling power draw. A driver or firmware update shifts numerics on the same binary you shipped last month. None of this announces itself. There is no crash, no exception, no red line in a log. The model just gets a little worse at the worst possible time, in the field, where reproduction needs the exact device and the exact conditions.
This is the silent regression, and the reason it survives is that the thing most teams rely on to catch it — a passing test in CI — was never designed to prove the thing they actually need proven.
“Passing” and “proven” are different claims
A green checkmark makes a narrow claim: a process ran in some environment and exited zero. That is genuinely useful. It is also not the claim a release engineer, an auditor, or a functional-safety assessor is asking you to make.
They are asking a different question: can you show, after the fact, exactly what was tested, on what hardware, with what inputs, and that the result has not been altered since? A checkmark cannot carry that weight. It has no identity (which model artifact, by hash?), no provenance (which physical device and SDK version?), no integrity (can anyone confirm the numbers were not edited in a spreadsheet afterward?), and no reproducibility (could a third party re-run it and land in the same place?).
For a web service, you paper over this with rollbacks. If something regresses, you redeploy a container in seconds and move on. On-device AI has no such luxury. The blast radius is physical, the rollback is an over-the-air campaign measured in days, and in regulated industries the evidence trail is not optional — it is the gate to shipping at all.
What evidence actually requires
If a dashboard screenshot is not evidence, what is? Four properties, and a test passes none of them by default:
Identity. The exact model is named by a content hash, not a branch or a tag. “v0.5” is a label someone can move. A SHA-256 of the actual weights is not.
Provenance. The result is bound to the real silicon it ran on, the runtime, and the SDK and driver versions — pinned, not assumed. An emulator result is a different claim than a result from a Snapdragon 8 Gen 2 with a known QAIRT version.
Integrity. The result is signed, so anyone can verify it has not been altered since the run. A cryptographic signature over the metrics and manifest turns “trust me” into “check it yourself.”
Traceability. Each check maps back to the requirement it satisfies. “Latency 8.2ms” is a number. “Latency 8.2ms, satisfying requirement SR-CABIN-014, ASIL D” is evidence an assessor can place in a safety case.
Bundle those four together and you have something durable: a signed evidence artifact that outlives the CI run, the Slack thread, and the engineer who remembers how it was tested.
The assessor test
Here is a simple way to know whether you have evidence or just a green build. Imagine a functional-safety assessor sits down six months from now and asks you to demonstrate that the model in the field is the one you verified, and that it met its safety requirements on the actual target hardware.
If your answer is “the CI run was green,” you do not have evidence. If your answer is a signed bundle — model hash, device, SDK version, per-requirement pass/fail with ASIL, and a signature anyone can verify — you do. The gap between those two answers is the gap between a testing tool and an evidence tool, and it is the gap that decides whether an automotive or medical program can ship at all.
Note the honest boundary here: evidence is not certification. A signed bundle does not make you compliant or replace the assessor's judgment. It produces the verifiable, reproducible, requirement-traced material the assessment depends on. That distinction matters, and overclaiming it is how tools lose credibility with the exact buyers who care most.
Putting it in the pipeline
The good news is that evidence does not require a new workflow. It rides the one you already have. A model change opens a pull request. A gate runs the model on real silicon — not an emulator, the actual chip — and measures what matters on device: latency, peak memory, the NPU/GPU/CPU compute split, accuracy parity against the full-precision reference, and, for on-device LLMs, safety behavior against a signed reference oracle.
If a gate fails, the PR does not merge. If it passes, the run emits a signed evidence bundle and attaches it to the result. The engineer's experience is unchanged — push, wait for the check — but the artifact left behind is now something an auditor accepts, not just something a dashboard displays.
The shift is small in workflow terms and large in consequence. You stop asking “did the test pass?” and start asking “can we prove what shipped?” For software that lives on a screen, the first question is enough. For AI that lives on a device in the physical world, only the second one is.
Prove what shipped — to an assessor, not just a dashboard
EdgeGate gates your models on real Snapdragon devices in every PR and signs an evidence bundle for every result — model hash, device, SDK, and per-requirement traceability. Free tier includes 10 runs/month.
Get Started FreeRelated Articles
ISO 26262 Evidence for On-Device AI: A Practical Clause-by-Clause Mapping
How a signed evidence bundle maps to ISO 26262 Part 6 and Part 8.
Evidence Bundles: Software Release Rigor for ML
What goes into a signed, auditable evidence bundle and why it matters.
How We Catch Silent NPU Fallback on Snapdragon in CI
The three CI assertions that catch silent CPU fallback before merge.