Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

arXiv cs.AI / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper investigates how to deploy high-accuracy automatic speech recognition (ASR) entirely on CPU for edge devices by balancing accuracy, latency, and memory footprint.
It benchmarks 50+ streaming-capable configurations across multiple ASR paradigms (encoder-decoder, transducer, and LLM-based), concluding that NVIDIA’s Nemotron Speech Streaming is the strongest fit for real-time English streaming on constrained hardware.
The authors rebuild the full streaming inference pipeline in ONNX Runtime and apply several post-training quantization methods plus graph-level operator fusion to reduce model size significantly.
Quantization and fusion shrink the model from 2.47 GB down to as little as 0.67 GB while keeping word error rate (WER) within 1% absolute of the full-precision PyTorch baseline.
The recommended int4 k-quant configuration delivers 8.20% average streaming WER across eight benchmarks with 0.56s algorithmic latency on CPU, achieving a new quality-efficiency trade-off point for on-device streaming ASR.
Point 2
Point 3

Abstract

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA's Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduce the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline. Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56 s algorithmic latency, establishing a new quality-efficiency Pareto point for on-device streaming ASR.

langchain-anthropic==1.4.1

LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Dev.to

Stop burning tokens on DOM noise: a Playwright MCP optimizer layer

Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

Dev.to

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

Key Points

Abstract

Related Articles

langchain-anthropic==1.4.1

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Stop burning tokens on DOM noise: a Playwright MCP optimizer layer

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer