TL;DR
100 runs of MobileNetV2 on Snapdragon 8 Gen 3 showed 83% latency spread (0.358–0.665 ms), a 7.3x cold-start penalty, and 1.5% mean-vs-median skew from outlier spikes. ResNet50 exceeded both our inference (1.403 ms) and memory (236.6 MB) gates — caught automatically. Every result is Ed25519-signed and SHA-256 hashed.
Experiment Setup
We wanted to answer a simple question: how much does inference latency vary when you run the same model on the same device repeatedly?
To find out, we used two widely-known image classification models at different complexity levels:
| Model | Parameters | ONNX Size | Source |
|---|---|---|---|
| MobileNetV2 | 3.5M | 13.3 MB | PyTorch torchvision (ImageNet pretrained) |
| ResNet50 | 25.6M | 97.4 MB | PyTorch torchvision (ImageNet pretrained) |
Both models were exported to ONNX (opset 13), compiled for Samsung Galaxy S24 (Snapdragon 8 Gen 3 / SM8650) through Qualcomm AI Hub, and profiled 100 times on real hardware. No emulators. No simulators. Real silicon.
MobileNetV2: 100 Runs, 83% Spread
MobileNetV2 Results
PASSED0.369 ms
Median
0.375 ms
Mean
2.689 ms
Cold Start
83.2%
Spread
| Metric | Value |
|---|---|
| Total runs | 100 (2 warmup excluded → 98 valid) |
| Median (post-warmup) | 0.369 ms |
| Mean (post-warmup) | 0.375 ms (1.5% higher than median) |
| Std Deviation | 0.031 ms |
| Min | 0.358 ms |
| Max | 0.665 ms (1.8x median — outlier spike) |
| Cold-start (run 1) | 2.689 ms (7.3x slower than median) |
| Coefficient of Variation | 8.3% (below 15% flaky threshold) |
| Gate: inference_time_ms ≤ 1.0 ms | PASSED |
What This Tells Us
The cold-start effect is dramatic. Run 1 came in at 2.689 ms — over 7x slower than the median. Run 2 was 0.428 ms. By run 3, the device had settled into steady state. This is why warmup exclusion isn't optional: without it, your “benchmark” is dominated by cache-miss initialization overhead that doesn't reflect production behavior.
Outlier spikes happen in steady state too. Run 12 spiked to 0.665 ms — 80% above the median. This is likely OS scheduling contention or thermal throttling. If you benchmark with a single run and happen to hit this spike, you report a wrong number.
Mean is inflated by spikes. The mean (0.375 ms) is 1.5% higher than the median (0.369 ms). With a small, fast model like MobileNetV2 this gap is modest. With larger models or under thermal stress, the gap can be 5–15%. The median is the robust choice.
ResNet50: Gates Catch a Real Regression
ResNet50 Results
FAILED1.403 ms
Median
236.6 MB
Peak Memory
3.958 ms
Cold Start
23.9%
Spread
| Metric | Value |
|---|---|
| Total runs | 100 (2 warmup excluded → 98 valid) |
| Median (post-warmup) | 1.403 ms |
| Mean (post-warmup) | 1.413 ms (0.7% higher than median) |
| Std Deviation | 0.041 ms |
| Min / Max | 1.376 ms / 1.711 ms |
| Peak Memory | 236.6 MB |
| Cold-start (run 1) | 3.958 ms (2.8x slower) |
| CV | 2.9% (stable) |
| Gate: inference_time_ms ≤ 1.0 ms | FAILED (1.403 ms) |
| Gate: peak_memory_mb ≤ 150 MB | FAILED (236.6 MB) |
What This Tells Us
Gates work. ResNet50 is a solid model, but at 25.6M parameters it's too heavy for our sub-millisecond latency gate and 150 MB memory cap. The system caught both violations automatically — no human review needed. This is the difference between “ship it, looks fine” and an automated quality gate that blocks the release.
Variability is lower on larger models. ResNet50's CV was only 2.9% (vs 8.3% for MobileNetV2). Heavier workloads tend to saturate the NPU more consistently, reducing the relative impact of scheduling noise. But the cold-start effect is still there: 3.958 ms vs 1.403 ms steady-state.
Five Takeaways from 200 On-Device Runs
1. Never Benchmark with a Single Run
With 83% spread on MobileNetV2, a single-run benchmark is essentially a random sample from a wide distribution. You could report 0.358 ms (best case), 0.665 ms (worst case), or 2.689 ms (cold start) and be equally “correct.” None of these represent actual production performance.
2. Warmup Exclusion Is Non-Negotiable
Both models showed significant cold-start penalties: 7.3x for MobileNetV2 and 2.8x for ResNet50. The first 1–2 runs reflect cache loading, NPU initialization, and memory allocation — not the inference speed your users will experience in production. Exclude them.
3. Median Beats Mean for Gate Decisions
In both models, the mean was higher than the median because occasional spikes pull it up (right-skewed distribution). The median naturally ignores outliers on both ends. For pass/fail gate decisions, this robustness matters: you don't want one thermal throttle event to flip your gate from pass to fail.
4. CV Tells You When to Trust Your Numbers
We flag any metric with a coefficient of variation above 15% as “flaky.” MobileNetV2 at 8.3% CV is stable — you can trust the median. If CV were 20%, you'd want to investigate: thermal issues, background processes, firmware bugs. The CV is your signal-to-noise indicator.
5. Gates Catch Real Issues Automatically
MobileNetV2 passed. ResNet50 failed. No ambiguity, no spreadsheet review, no Slack thread asking “is this number okay?” The gate evaluated the median against the threshold and returned a deterministic verdict. This is the point of automated quality gates.
Methodology
Both models were exported from PyTorch's torchvision with ImageNet pretrained weights using torch.onnx.export() at opset 13. Compilation and profiling were done through Qualcomm AI Hub's cloud API targeting the Samsung Galaxy S24 (Family) device profile — Snapdragon 8 Gen 3, SM8650.
Each profile job ran the compiled model 100 times on real hardware, returning per-run inference times in microseconds via the all_inference_times array in the profile response. We excluded the first 2 runs as warmup and computed statistics on the remaining 98 measurements.
Quality gates used EdgeGate's default thresholds: inference_time_ms ≤ 1.0 and peak_memory_mb ≤ 150. Evidence bundles were generated with Ed25519 signatures and SHA-256 model hashes.
Evidence & Verification
Every number in this post is backed by a signed evidence bundle. Here are the references:
| Model | Evidence ID | Compile Job | Profile Job |
|---|---|---|---|
| MobileNetV2 | e26730a7 | jgz7zjoop | j5w9y323p |
| ResNet50 | e7ab3b6e | jp18oww7g | jp4wedd1g |
Evidence bundles are Ed25519-signed and include the full raw timing arrays, model SHA-256 hashes, gate evaluations, and device metadata. The signature chain proves the data was generated by EdgeGate and has not been tampered with.
Test Your Models on Real Hardware
EdgeGate runs your models on real Snapdragon devices with median-of-N gating, warmup exclusion, flake detection, and signed evidence bundles. Free tier includes 10 runs/month.
Get Started Freearrow_forward