EdgeGate is a hardware-in-the-loop CI/CD platform that runs automated regression tests on real Snapdragon devices through Qualcomm AI Hub. It provides deterministic CI gating with signed evidence bundles for edge AI deployments.

How does EdgeGate test on real devices?

EdgeGate orchestrates test runs on physical Snapdragon chipsets via Qualcomm AI Hub. Your model is compiled and profiled on real hardware, capturing actual latency, accuracy, and thermal behavior that emulators cannot reproduce.

What CI/CD systems does EdgeGate integrate with?

EdgeGate integrates with GitHub Actions and any CI/CD system that supports webhooks. A single YAML file configures the integration, and results appear as PR checks that can block merges on regression.

What are signed evidence bundles?

Every EdgeGate test run produces a cryptographically signed evidence bundle containing SHA-256 hashes and Ed25519 signatures. This provides an auditable proof that your AI model was validated on real hardware — useful for team reviews and regulatory compliance.

Is EdgeGate free to use?

Yes. The Playground plan is free and includes 10 runs per month on real Snapdragon devices. Paid plans (Pro at $49/month, Team at $149/month) offer more runs, additional features like flake detection, RBAC, and API access.

Do I need my own Qualcomm AI Hub API key?

The free Playground plan includes hosted device access. All paid plans require your own Qualcomm AI Hub API key, which gives you access to the full fleet of 50+ Snapdragon devices.

100 Inference Runs on Snapdragon: What the Data Shows

TL;DR

100 runs of MobileNetV2 on Snapdragon 8 Gen 3 showed 83% latency spread (0.358–0.665 ms), a 7.3x cold-start penalty, and 1.5% mean-vs-median skew from outlier spikes. ResNet50 exceeded both our inference (1.403 ms) and memory (236.6 MB) gates — caught automatically. Every result is Ed25519-signed and SHA-256 hashed.

Experiment Setup

We wanted to answer a simple question: how much does inference latency vary when you run the same model on the same device repeatedly?

To find out, we used two widely-known image classification models at different complexity levels:

Model	Parameters	ONNX Size	Source
MobileNetV2	3.5M	13.3 MB	PyTorch torchvision (ImageNet pretrained)
ResNet50	25.6M	97.4 MB	PyTorch torchvision (ImageNet pretrained)

Both models were exported to ONNX (opset 13), compiled for Samsung Galaxy S24 (Snapdragon 8 Gen 3 / SM8650) through Qualcomm AI Hub, and profiled 100 times on real hardware. No emulators. No simulators. Real silicon.

MobileNetV2: 100 Runs, 83% Spread

MobileNetV2 Results

PASSED

0.369 ms

Median

0.375 ms

Mean

2.689 ms

Cold Start

83.2%

Spread

Metric	Value
Total runs	100 (2 warmup excluded → 98 valid)
Median (post-warmup)	0.369 ms
Mean (post-warmup)	0.375 ms (1.5% higher than median)
Std Deviation	0.031 ms
Min	0.358 ms
Max	0.665 ms (1.8x median — outlier spike)
Cold-start (run 1)	2.689 ms (7.3x slower than median)
Coefficient of Variation	8.3% (below 15% flaky threshold)
Gate: inference_time_ms ≤ 1.0 ms	PASSED

What This Tells Us

The cold-start effect is dramatic. Run 1 came in at 2.689 ms — over 7x slower than the median. Run 2 was 0.428 ms. By run 3, the device had settled into steady state. This is why warmup exclusion isn't optional: without it, your “benchmark” is dominated by cache-miss initialization overhead that doesn't reflect production behavior.

Outlier spikes happen in steady state too. Run 12 spiked to 0.665 ms — 80% above the median. This is likely OS scheduling contention or thermal throttling. If you benchmark with a single run and happen to hit this spike, you report a wrong number.

Mean is inflated by spikes. The mean (0.375 ms) is 1.5% higher than the median (0.369 ms). With a small, fast model like MobileNetV2 this gap is modest. With larger models or under thermal stress, the gap can be 5–15%. The median is the robust choice.

ResNet50: Gates Catch a Real Regression

ResNet50 Results

FAILED

1.403 ms

Median

236.6 MB

Peak Memory

3.958 ms

Cold Start

23.9%

Spread

Metric	Value
Total runs	100 (2 warmup excluded → 98 valid)
Median (post-warmup)	1.403 ms
Mean (post-warmup)	1.413 ms (0.7% higher than median)
Std Deviation	0.041 ms
Min / Max	1.376 ms / 1.711 ms
Peak Memory	236.6 MB
Cold-start (run 1)	3.958 ms (2.8x slower)
CV	2.9% (stable)
Gate: inference_time_ms ≤ 1.0 ms	FAILED (1.403 ms)
Gate: peak_memory_mb ≤ 150 MB	FAILED (236.6 MB)

What This Tells Us

Gates work. ResNet50 is a solid model, but at 25.6M parameters it's too heavy for our sub-millisecond latency gate and 150 MB memory cap. The system caught both violations automatically — no human review needed. This is the difference between “ship it, looks fine” and an automated quality gate that blocks the release.

Variability is lower on larger models. ResNet50's CV was only 2.9% (vs 8.3% for MobileNetV2). Heavier workloads tend to saturate the NPU more consistently, reducing the relative impact of scheduling noise. But the cold-start effect is still there: 3.958 ms vs 1.403 ms steady-state.

Five Takeaways from 200 On-Device Runs

1. Never Benchmark with a Single Run

With 83% spread on MobileNetV2, a single-run benchmark is essentially a random sample from a wide distribution. You could report 0.358 ms (best case), 0.665 ms (worst case), or 2.689 ms (cold start) and be equally “correct.” None of these represent actual production performance.

2. Warmup Exclusion Is Non-Negotiable

Both models showed significant cold-start penalties: 7.3x for MobileNetV2 and 2.8x for ResNet50. The first 1–2 runs reflect cache loading, NPU initialization, and memory allocation — not the inference speed your users will experience in production. Exclude them.

3. Median Beats Mean for Gate Decisions

In both models, the mean was higher than the median because occasional spikes pull it up (right-skewed distribution). The median naturally ignores outliers on both ends. For pass/fail gate decisions, this robustness matters: you don't want one thermal throttle event to flip your gate from pass to fail.

4. CV Tells You When to Trust Your Numbers

We flag any metric with a coefficient of variation above 15% as “flaky.” MobileNetV2 at 8.3% CV is stable — you can trust the median. If CV were 20%, you'd want to investigate: thermal issues, background processes, firmware bugs. The CV is your signal-to-noise indicator.

5. Gates Catch Real Issues Automatically

MobileNetV2 passed. ResNet50 failed. No ambiguity, no spreadsheet review, no Slack thread asking “is this number okay?” The gate evaluated the median against the threshold and returned a deterministic verdict. This is the point of automated quality gates.

Methodology

Both models were exported from PyTorch's torchvision with ImageNet pretrained weights using torch.onnx.export() at opset 13. Compilation and profiling were done through Qualcomm AI Hub's cloud API targeting the Samsung Galaxy S24 (Family) device profile — Snapdragon 8 Gen 3, SM8650.

Each profile job ran the compiled model 100 times on real hardware, returning per-run inference times in microseconds via the all_inference_times array in the profile response. We excluded the first 2 runs as warmup and computed statistics on the remaining 98 measurements.

Quality gates used EdgeGate's default thresholds: inference_time_ms ≤ 1.0 and peak_memory_mb ≤ 150. Evidence bundles were generated with Ed25519 signatures and SHA-256 model hashes.

Evidence & Verification

Every number in this post is backed by a signed evidence bundle. Here are the references:

Model	Evidence ID	Compile Job	Profile Job
MobileNetV2	`e26730a7`	`jgz7zjoop`	`j5w9y323p`
ResNet50	`e7ab3b6e`	`jp18oww7g`	`jp4wedd1g`

Evidence bundles are Ed25519-signed and include the full raw timing arrays, model SHA-256 hashes, gate evaluations, and device metadata. The signature chain proves the data was generated by EdgeGate and has not been tampered with.

Test Your Models on Real Hardware

EdgeGate runs your models on real Snapdragon devices with median-of-N gating, warmup exclusion, flake detection, and signed evidence bundles. Free tier includes 10 runs/month.

Get Started Freearrow_forward

Deterministic Testing for Non-Deterministic Models: How Median-of-N Gating Works →Evidence Bundles: Bringing Software Release Rigor to ML Model Deployment →Why Cloud Benchmarks Lie About Edge Performance →View Full Benchmark Case Study (FP32 vs INT8) →

100 Inference Runs on Snapdragon: What the Data Shows

Experiment Setup

MobileNetV2: 100 Runs, 83% Spread

MobileNetV2 Results

What This Tells Us

ResNet50: Gates Catch a Real Regression

ResNet50 Results

What This Tells Us

Five Takeaways from 200 On-Device Runs

1. Never Benchmark with a Single Run

2. Warmup Exclusion Is Non-Negotiable

3. Median Beats Mean for Gate Decisions

4. CV Tells You When to Trust Your Numbers

5. Gates Catch Real Issues Automatically

Methodology

Evidence & Verification

Test Your Models on Real Hardware

Related Posts