Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

arXiv cs.CL / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper addresses a key limitation of unified ASR (automatic speech recognition) models: getting strong performance in both offline decoding and low-latency streaming decoding using a single model.
  • It proposes a Unified ASR framework for RNNT that enables both offline and streaming decoding through chunk-limited attention (with right context) and dynamic chunked convolutions.
  • To reduce the offline–streaming performance gap, the authors introduce mode-consistency regularization for RNNT (MCR-RNNT), implemented efficiently with Triton to encourage agreement across different training modes.
  • Experiments indicate improved low-latency streaming accuracy without sacrificing offline performance, and results scale to larger models and bigger training datasets.
  • The unified framework and an English RNNT model checkpoint are open-sourced, enabling further adoption and experimentation.

Abstract

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.