BenchmarksFebruary 25, 2026·8 min read

100 Inference Runs on Snapdragon: What the Data Shows

We compiled and profiled two models — MobileNetV2 and ResNet50 — on a Samsung Galaxy S24 (Snapdragon 8 Gen 3) via Qualcomm AI Hub. Each model ran 100 times on real hardware. Here's the raw data, what it means, and why it matters for your quality gates.

EdgeGate Team

EdgeGate Engineering Team

Edge AI CI/CD platform · Qualcomm AI Hub integration partners

TL;DR

100 runs of MobileNetV2 on Snapdragon 8 Gen 3 showed 83% latency spread (0.358–0.665 ms), a 7.3x cold-start penalty, and 1.5% mean-vs-median skew from outlier spikes. ResNet50 exceeded both our inference (1.403 ms) and memory (236.6 MB) gates — caught automatically. Every result is Ed25519-signed and SHA-256 hashed.

Experiment Setup

We wanted to answer a simple question: how much does inference latency vary when you run the same model on the same device repeatedly?

To find out, we used two widely-known image classification models at different complexity levels:

ModelParametersONNX SizeSource
MobileNetV23.5M13.3 MBPyTorch torchvision (ImageNet pretrained)
ResNet5025.6M97.4 MBPyTorch torchvision (ImageNet pretrained)

Both models were exported to ONNX (opset 13), compiled for Samsung Galaxy S24 (Snapdragon 8 Gen 3 / SM8650) through Qualcomm AI Hub, and profiled 100 times on real hardware. No emulators. No simulators. Real silicon.

MobileNetV2: 100 Runs, 83% Spread

MobileNetV2 Results

PASSED

0.369 ms

Median

0.375 ms

Mean

2.689 ms

Cold Start

83.2%

Spread

MetricValue
Total runs100 (2 warmup excluded → 98 valid)
Median (post-warmup)0.369 ms
Mean (post-warmup)0.375 ms (1.5% higher than median)
Std Deviation0.031 ms
Min0.358 ms
Max0.665 ms (1.8x median — outlier spike)
Cold-start (run 1)2.689 ms (7.3x slower than median)
Coefficient of Variation8.3% (below 15% flaky threshold)
Gate: inference_time_ms ≤ 1.0 msPASSED

What This Tells Us

The cold-start effect is dramatic. Run 1 came in at 2.689 ms — over 7x slower than the median. Run 2 was 0.428 ms. By run 3, the device had settled into steady state. This is why warmup exclusion isn't optional: without it, your “benchmark” is dominated by cache-miss initialization overhead that doesn't reflect production behavior.

Outlier spikes happen in steady state too. Run 12 spiked to 0.665 ms — 80% above the median. This is likely OS scheduling contention or thermal throttling. If you benchmark with a single run and happen to hit this spike, you report a wrong number.

Mean is inflated by spikes. The mean (0.375 ms) is 1.5% higher than the median (0.369 ms). With a small, fast model like MobileNetV2 this gap is modest. With larger models or under thermal stress, the gap can be 5–15%. The median is the robust choice.

ResNet50: Gates Catch a Real Regression

ResNet50 Results

FAILED

1.403 ms

Median

236.6 MB

Peak Memory

3.958 ms

Cold Start

23.9%

Spread

MetricValue
Total runs100 (2 warmup excluded → 98 valid)
Median (post-warmup)1.403 ms
Mean (post-warmup)1.413 ms (0.7% higher than median)
Std Deviation0.041 ms
Min / Max1.376 ms / 1.711 ms
Peak Memory236.6 MB
Cold-start (run 1)3.958 ms (2.8x slower)
CV2.9% (stable)
Gate: inference_time_ms ≤ 1.0 msFAILED (1.403 ms)
Gate: peak_memory_mb ≤ 150 MBFAILED (236.6 MB)

What This Tells Us

Gates work. ResNet50 is a solid model, but at 25.6M parameters it's too heavy for our sub-millisecond latency gate and 150 MB memory cap. The system caught both violations automatically — no human review needed. This is the difference between “ship it, looks fine” and an automated quality gate that blocks the release.

Variability is lower on larger models. ResNet50's CV was only 2.9% (vs 8.3% for MobileNetV2). Heavier workloads tend to saturate the NPU more consistently, reducing the relative impact of scheduling noise. But the cold-start effect is still there: 3.958 ms vs 1.403 ms steady-state.

Five Takeaways from 200 On-Device Runs

1. Never Benchmark with a Single Run

With 83% spread on MobileNetV2, a single-run benchmark is essentially a random sample from a wide distribution. You could report 0.358 ms (best case), 0.665 ms (worst case), or 2.689 ms (cold start) and be equally “correct.” None of these represent actual production performance.

2. Warmup Exclusion Is Non-Negotiable

Both models showed significant cold-start penalties: 7.3x for MobileNetV2 and 2.8x for ResNet50. The first 1–2 runs reflect cache loading, NPU initialization, and memory allocation — not the inference speed your users will experience in production. Exclude them.

3. Median Beats Mean for Gate Decisions

In both models, the mean was higher than the median because occasional spikes pull it up (right-skewed distribution). The median naturally ignores outliers on both ends. For pass/fail gate decisions, this robustness matters: you don't want one thermal throttle event to flip your gate from pass to fail.

4. CV Tells You When to Trust Your Numbers

We flag any metric with a coefficient of variation above 15% as “flaky.” MobileNetV2 at 8.3% CV is stable — you can trust the median. If CV were 20%, you'd want to investigate: thermal issues, background processes, firmware bugs. The CV is your signal-to-noise indicator.

5. Gates Catch Real Issues Automatically

MobileNetV2 passed. ResNet50 failed. No ambiguity, no spreadsheet review, no Slack thread asking “is this number okay?” The gate evaluated the median against the threshold and returned a deterministic verdict. This is the point of automated quality gates.

Methodology

Both models were exported from PyTorch's torchvision with ImageNet pretrained weights using torch.onnx.export() at opset 13. Compilation and profiling were done through Qualcomm AI Hub's cloud API targeting the Samsung Galaxy S24 (Family) device profile — Snapdragon 8 Gen 3, SM8650.

Each profile job ran the compiled model 100 times on real hardware, returning per-run inference times in microseconds via the all_inference_times array in the profile response. We excluded the first 2 runs as warmup and computed statistics on the remaining 98 measurements.

Quality gates used EdgeGate's default thresholds: inference_time_ms ≤ 1.0 and peak_memory_mb ≤ 150. Evidence bundles were generated with Ed25519 signatures and SHA-256 model hashes.

Evidence & Verification

Every number in this post is backed by a signed evidence bundle. Here are the references:

ModelEvidence IDCompile JobProfile Job
MobileNetV2e26730a7jgz7zjoopj5w9y323p
ResNet50e7ab3b6ejp18oww7gjp4wedd1g

Evidence bundles are Ed25519-signed and include the full raw timing arrays, model SHA-256 hashes, gate evaluations, and device metadata. The signature chain proves the data was generated by EdgeGate and has not been tampered with.

Test Your Models on Real Hardware

EdgeGate runs your models on real Snapdragon devices with median-of-N gating, warmup exclusion, flake detection, and signed evidence bundles. Free tier includes 10 runs/month.

Get Started Freearrow_forward