Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization
arXiv cs.CL / 4/22/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper addresses a key limitation of unified ASR (automatic speech recognition) models: getting strong performance in both offline decoding and low-latency streaming decoding using a single model.
- It proposes a Unified ASR framework for RNNT that enables both offline and streaming decoding through chunk-limited attention (with right context) and dynamic chunked convolutions.
- To reduce the offline–streaming performance gap, the authors introduce mode-consistency regularization for RNNT (MCR-RNNT), implemented efficiently with Triton to encourage agreement across different training modes.
- Experiments indicate improved low-latency streaming accuracy without sacrificing offline performance, and results scale to larger models and bigger training datasets.
- The unified framework and an English RNNT model checkpoint are open-sourced, enabling further adoption and experimentation.
Related Articles

Autoencoders and Representation Learning in Vision
Dev.to
Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.
Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful
Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge