TL;DR
Edge AI inference varies 30–40% between runs due to thermal state, memory pressure, and OS scheduling. Single-run and mean-based testing produce unreliable gates. Median-of-N gating (with warmup exclusion) gives you deterministic pass/fail decisions from non-deterministic hardware. Use N=5 for dev, N=11 for PR gates, N=21 for release qualification. Combine with CV tracking and flake detection for production-grade reliability.
Why Does Edge AI Inference Vary Between Runs?
Software tests are binary: the function returns the correct value or it doesn't. Edge AI testing doesn't have that luxury. Inference latency on a Snapdragon NPU varies between runs due to thermal state, memory pressure, OS scheduling, and firmware-level optimizations that behave differently under load.
Run a model 10 times on a Qualcomm RB5 and you might see latencies ranging from 42ms to 58ms. That's a 38% spread. If your quality gate is "latency must be under 50ms," this model passes 7 out of 10 times and fails 3. Is it a regression or just noise?
This is the core challenge of hardware-in-the-loop testing: you need deterministic pass/fail decisions from non-deterministic measurements.
Why Do Simple Testing Approaches Fail?
Single-Run Testing
The simplest approach — run once, check the number — is the least reliable. A single measurement is dominated by noise. Your gate will randomly pass and fail on the same model, eroding developer trust in the CI system. After a few false failures, engineers start ignoring the gate entirely.
Mean-of-N Testing
Averaging multiple runs is better, but means are sensitive to outliers. A single 200ms spike (caused by, say, a background garbage collection event on the device) can pull the average above your threshold even when typical performance is well within bounds.
Min/Max Testing
Gating on the minimum run is overly optimistic — it tells you the best case, not the typical case. Gating on the maximum is overly pessimistic — it fails on transient outliers that don't reflect real-world behavior.
How Does Median-of-N Gating Work?
Median-of-N gating solves these problems by combining statistical robustness with practical simplicity. Here's how it works:
Step 1: Warmup exclusion. The first K runs on a cold device are discarded. Cold-start latency is real but not representative of sustained performance. Typical warmup: 2–3 runs, depending on the model and device.
Step 2: Repeated measurement. Run inference N times (after warmup). N should be odd for a clean median. Common values: N=5 for fast iteration, N=11 for high-confidence gating, N=21 for release qualification.
Step 3: Take the median. The median is the middle value when sorted. It naturally ignores outliers on both ends. A single 200ms spike doesn't affect it. A single lucky 30ms run doesn't affect it either. You get the "typical" performance.
Step 4: Gate on the median. Compare the median against your threshold. This is your deterministic pass/fail decision derived from non-deterministic measurements.
How Do You Handle Flaky Test Results?
Even with median-of-N, some models land right on the boundary of a threshold. A model with a true median latency of 49.5ms against a 50ms gate will flip between pass and fail across different CI runs. This is the flake problem.
There are two complementary strategies:
Coefficient of Variation (CV) Tracking
Calculate the coefficient of variation (standard deviation divided by mean) across your N measurements. If the CV exceeds a threshold — say, 15% — flag the result as "high variance" regardless of whether the median passes. High variance means the measurement isn't trustworthy and the gate should be re-run or investigated.
Flake Detection with Historical Context
Track gate results over time. If the same model on the same device alternates between pass and fail across consecutive runs, it's a flake — not a real regression. A robust system detects this pattern and either auto-retries with a higher N or alerts the team that the model is operating too close to the threshold.
How Many Measurement Runs Should You Use?
The number of measurement runs is a tradeoff between confidence and CI cycle time. Here's a practical framework:
N=5 (fast feedback) — Good for development branches where you want quick iteration. Catches large regressions (20%+ degradation) reliably. Misses small regressions near the threshold.
N=11 (standard CI) — The sweet spot for most pull request gates. Statistically robust for regressions above 10%. Completes in a reasonable time for most models.
N=21 (release qualification) — Use for production release gates where you need high confidence. Detects regressions as small as 5%. Worth the extra time for builds that ship to customers.
The key insight: your PR gate and your release gate don't need the same N. Use a lower N for fast developer feedback and a higher N for the final quality check before deployment.
Can You Apply Median-of-N Beyond Latency?
Median-of-N isn't just for latency. The same approach works for any metric with run-to-run variance:
Memory (peak RSS) — memory usage can vary slightly between runs due to allocation patterns. Median smooths this out.
Throughput (inferences/second) — sustained throughput is affected by thermal state and scheduling. Median after warmup gives you the steady-state number.
Accuracy on device — if you're evaluating accuracy with a test suite that includes non-deterministic preprocessing (e.g., random crops, augmentations), median-of-N prevents a single unlucky batch from failing the gate.
What Does a Production Implementation Look Like?
A well-implemented gating system records every individual measurement, not just the final median. This gives you:
Trend analysis — even if the median stays below threshold, a gradual upward trend in raw measurements signals a slow regression that will eventually cross the line.
Variance tracking — sudden increases in variance (even with a passing median) can indicate instability in the model or device firmware that warrants investigation.
Audit trail — for regulated industries, you need proof of what was measured, how many times, and what the individual results were. A single number isn't sufficient for compliance.
The goal isn't perfect measurements — it's reliable decisions. Median-of-N with warmup exclusion and flake detection gives you decisions you can trust, on hardware that doesn't always give you the same answer twice.
Median-of-N gating, built in
EdgeGate handles warmup exclusion, median-of-N measurement, CV tracking, and flake detection automatically. Configure your thresholds and let the system handle the statistics.
Get Started FreeRelated Articles
Evidence Bundles: Software Release Rigor for ML
Cryptographically signed proof that every model passed quality gates on real hardware.
The Hidden Cost of Edge AI Regressions
Why optimized models break on real Snapdragon hardware and how to prevent it.
Hardware-in-the-Loop Testing for AI: A Practical Guide
Why emulators aren't enough and how to test on real devices in CI.