Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

arXiv cs.CL / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper addresses a key limitation of unified ASR (automatic speech recognition) models: getting strong performance in both offline decoding and low-latency streaming decoding using a single model.
It proposes a Unified ASR framework for RNNT that enables both offline and streaming decoding through chunk-limited attention (with right context) and dynamic chunked convolutions.
To reduce the offline–streaming performance gap, the authors introduce mode-consistency regularization for RNNT (MCR-RNNT), implemented efficiently with Triton to encourage agreement across different training modes.
Experiments indicate improved low-latency streaming accuracy without sacrificing offline performance, and results scale to larger models and bigger training datasets.
The unified framework and an English RNNT model checkpoint are open-sourced, enabling further adoption and experimentation.

Abstract

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.

Autoencoders and Representation Learning in Vision

Dev.to

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer