Diffusion Language Models for Speech Recognition

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how diffusion language models (including masked diffusion language models and uniform-state diffusion models) can be adapted to improve speech recognition via ASR hypothesis rescoring.
  • It presents practical guidance for incorporating MDLM and USDM into the rescoring pipeline and compares their effectiveness on recognized text accuracy.
  • A new joint-decoding approach is proposed that fuses CTC-derived framewise probability distributions with USDM-derived labelwise probability distributions at each decoding step to generate better candidate transcriptions.
  • The results indicate that both USDM and MDLM can significantly improve transcription accuracy compared with standard approaches, and the authors release code and recipes for reproducibility.

Abstract

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.