Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
arXiv cs.AI / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper investigates how to deploy high-accuracy automatic speech recognition (ASR) entirely on CPU for edge devices by balancing accuracy, latency, and memory footprint.
- It benchmarks 50+ streaming-capable configurations across multiple ASR paradigms (encoder-decoder, transducer, and LLM-based), concluding that NVIDIA’s Nemotron Speech Streaming is the strongest fit for real-time English streaming on constrained hardware.
- The authors rebuild the full streaming inference pipeline in ONNX Runtime and apply several post-training quantization methods plus graph-level operator fusion to reduce model size significantly.
- Quantization and fusion shrink the model from 2.47 GB down to as little as 0.67 GB while keeping word error rate (WER) within 1% absolute of the full-precision PyTorch baseline.
- The recommended int4 k-quant configuration delivers 8.20% average streaming WER across eight benchmarks with 0.56s algorithmic latency on CPU, achieving a new quality-efficiency trade-off point for on-device streaming ASR.
- Point 2
- Point 3
Related Articles
langchain-anthropic==1.4.1
LangChain Releases
🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development
Dev.to
Stop burning tokens on DOM noise: a Playwright MCP optimizer layer
Dev.to
Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs
Dev.to
AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.
Dev.to