EdgeGate is a hardware-in-the-loop CI/CD platform that runs automated regression tests on real Snapdragon devices through Qualcomm AI Hub. It provides deterministic CI gating with signed evidence bundles for edge AI deployments.

How does EdgeGate test on real devices?

EdgeGate orchestrates test runs on physical Snapdragon chipsets via Qualcomm AI Hub. Your model is compiled and profiled on real hardware, capturing actual latency, accuracy, and thermal behavior that emulators cannot reproduce.

What CI/CD systems does EdgeGate integrate with?

EdgeGate integrates with GitHub Actions and any CI/CD system that supports webhooks. A single YAML file configures the integration, and results appear as PR checks that can block merges on regression.

What are signed evidence bundles?

Every EdgeGate test run produces a cryptographically signed evidence bundle containing SHA-256 hashes and Ed25519 signatures. This provides an auditable proof that your AI model was validated on real hardware — useful for team reviews and regulatory compliance.

Is EdgeGate free to use?

Yes. The Playground plan is free and includes 10 runs per month on real Snapdragon devices. Paid plans (Pro at $49/month, Team at $149/month) offer more runs, additional features like flake detection, RBAC, and API access.

Do I need my own Qualcomm AI Hub API key?

The free Playground plan includes hosted device access. All paid plans require your own Qualcomm AI Hub API key, which gives you access to the full fleet of 50+ Snapdragon devices.

Model Quantization Testing for Edge AI: FP32 vs INT8 on Real Hardware

TL;DR

Model quantization (FP32 → INT8) typically reduces model size by 60–75% and can improve inference speed, but introduces four categories of regression: accuracy degradation, operator fallback (NPU → CPU), memory layout changes, and numerical instability. Testing on real hardware is essential because quantization behavior varies by chipset, runtime, and operator implementation. EdgeGate automates this by running both FP32 and INT8 variants on real Snapdragon devices and comparing against quality gates.

What Is Model Quantization and Why Does Every Edge AI Team Use It?

Quantization reduces the numerical precision of a model's weights and activations — typically from 32-bit floating point (FP32) to 8-bit integers (INT8). The benefits are significant: smaller model files, lower memory usage, and often faster inference on hardware accelerators designed for integer math.

For edge deployment, quantization isn't optional. A model running on a Snapdragon 8 Gen 3 in a mobile or robotics application has hard constraints: limited memory, power budget, and latency requirements that FP32 models often can't meet. INT8 quantization is the standard path to fitting within those constraints.

The catch is that quantization changes the model's behavior in ways that are difficult to predict from cloud-side analysis alone. The same quantization that works perfectly on one chipset can introduce regressions on another, because the underlying NPU operator implementations differ.

What Are the Four Types of Quantization Regression?

When you quantize a model for edge deployment, four categories of regression can emerge. Each manifests differently on real hardware versus cloud simulations.

Regression Type	Symptom	Why Cloud Testing Misses It
Accuracy degradation	mAP drops 2–5% after quantization	Cloud quantization tools report acceptable accuracy; real device execution path differs
Operator fallback	Latency increases 3–10x	NPU doesn't support a quantized operator variant; falls back to CPU silently
Memory layout change	Peak memory spikes despite smaller model	INT8 tensors require different memory alignment; runtime allocates extra buffers
Numerical instability	Output variance increases between runs	Rounding errors accumulate differently on NPU vs CPU/GPU paths

The operator fallback problem is particularly insidious. Your model file is smaller, your cloud benchmarks look good, but on the actual device, a single unsupported INT8 operator forces the entire subgraph to execute on the CPU. Latency jumps from sub-millisecond to tens of milliseconds, and you only discover it when users complain.

What Does Real-World FP32 vs INT8 Data Look Like?

We tested a person-detection model (MobileNet-style depthwise separable CNN) on real Snapdragon 8 Gen 3 hardware via Qualcomm AI Hub. Both FP32 and INT8 variants were compiled, deployed, and profiled on a Samsung Galaxy S24 family device.

Metric	FP32	INT8	Gate
Inference time	0.1760 ms	0.1870 ms	≤ 1.0 ms &check;
Peak memory	121.51 MB	124.66 MB	≤ 150 MB &check;
Model file size	1.07 MB	322 KB	—
Parameters	270,146	71,074	—
Evidence ID	dc2e9f67	875d3c6f	—

Both models passed all gates. The INT8 variant achieved a 70.5% reduction in model file size while maintaining comparable inference latency. Notably, the INT8 model used slightly more peak memory (124.66 MB vs 121.51 MB) — a counterintuitive result that illustrates why you can't assume quantization always reduces every metric.

The memory increase happens because INT8 models sometimes require additional dequantization buffers and different tensor layout alignment. This is exactly the kind of regression that cloud-only benchmarking misses.

How Should You Structure Quantization Quality Gates?

Effective quantization testing requires comparing the quantized model against its full-precision baseline on the same physical device under the same conditions. Here's a gate structure we recommend:

Gate	Threshold	What It Catches
Absolute latency	INT8 inference ≤ your latency SLA	Hard latency violations regardless of cause
Relative latency	INT8 latency ≤ 1.5x FP32 latency	Operator fallback (INT8 should be faster or equal, not slower)
Peak memory	INT8 memory ≤ your memory budget	Memory layout regressions, buffer allocation spikes
Accuracy delta	INT8 accuracy ≥ FP32 accuracy − 1%	Precision loss beyond acceptable threshold
Output variance	Std dev across N runs ≤ 5% of mean	Numerical instability from quantization rounding

The relative latency gate is critical. If your INT8 model is slower than FP32, something is wrong — most likely operator fallback. This gate catches the problem immediately.

Why Can't You Test Quantization in the Cloud?

Cloud-based quantization tools (ONNX Runtime, TensorRT, PyTorch quantization) give you a quantized model and accuracy metrics. What they can't tell you is how that model behaves on a specific NPU with a specific firmware version.

The Snapdragon 8 Gen 3's Hexagon NPU has a specific set of INT8 operators it accelerates natively. When a quantized model uses an operator outside that set, the runtime silently falls back to CPU execution. This fallback doesn't appear in any cloud benchmark — it only shows up when you profile on the actual device.

Similarly, memory behavior varies between chipsets. The same INT8 model might use 120 MB on a Snapdragon 8 Gen 3 but 180 MB on a Snapdragon 7+ Gen 2 due to different tensor layout requirements. Without testing on each target device, you're guessing.

How Do You Automate Quantization Testing with EdgeGate?

EdgeGate automates the full quantization testing workflow:

Upload both variants: Push your FP32 and INT8 ONNX models to the same pipeline. EdgeGate compiles both for your target Snapdragon device via Qualcomm AI Hub.
Run on real hardware: Both models execute on the same physical device, eliminating device-to-device variance from the comparison.
Gate evaluation: Results are compared against absolute thresholds and relative thresholds (INT8 vs FP32). You get a clear PASS or FAIL for each gate.
Evidence bundles: Both runs produce Ed25519-signed evidence reports with model SHA-256 hashes, device attestation, and raw performance data. Tamper-proof proof of what happened.
CI integration: Wire this into GitHub Actions so every PR that changes model files or quantization configs triggers a comparison test automatically.

What Are Best Practices for Quantization Testing on Snapdragon?

After running hundreds of quantization comparisons on real Snapdragon hardware, here are the patterns that matter:

Always test both variants on the same device, same session. Device-to-device and session-to-session variance can mask real regressions.
Use warmup iterations. The first 2–3 runs on a cold device include NPU initialization overhead. Exclude them from your measurements.
Set the relative latency gate tight. INT8 should be equal to or faster than FP32. If it's even 20% slower, investigate operator fallback.
Monitor model file size as a sanity check. INT8 models should be 60–75% smaller. If the reduction is less, some weights may not have quantized correctly.
Test across your full device matrix. Quantization behavior varies by chipset. A model that passes on Snapdragon 8 Gen 3 may fail on Snapdragon 7+ Gen 2.
Track trends over time. A single quantization run is a snapshot. Track latency and memory across model versions to catch gradual drift.

Test your quantized models on real hardware

EdgeGate runs FP32 and INT8 variants on real Snapdragon devices and gives you a pass/fail verdict in your CI pipeline.

See Real Benchmark Data Get Started Free

Why Cloud Benchmarks Lie About Edge Performance

Your model hits 12ms in the cloud. On Snapdragon, it takes 47ms. Here's why.

Deterministic Testing for Non-Deterministic Models

How median-of-N gating and flake detection bring statistical rigor to hardware testing.

The Hidden Cost of Edge AI Regressions

Why optimized models break on real Snapdragon hardware and how to prevent it.