Text-Utilization for Encoder-dominated Speech Recognition Models

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to leverage text-only data to improve speech recognition, particularly for encoder-dominated models that support faster decoding.
  • It compares multiple approaches for integrating text-only data, including modality matching and dynamic downsampling to align text representations within the encoder.
  • Experiments on the LibriSpeech dataset indicate that using a larger encoder with a smaller decoder can match or outperform systems that rely on larger decoders.
  • The authors find that simpler setups—such as random duration models—can be more effective than more complex alternatives, reducing training pipeline complexity.
  • The research provides publicly available code and training “recipes” for reproducibility and practical adoption.

Abstract

This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.