Speech LLMs are Contextual Reasoning Transcribers

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces chain-of-thought ASR (CoT-ASR), which has an LLM analyze input speech to generate contextual reasoning before producing transcription, aiming to better leverage LLM knowledge in speech recognition.
  • CoT-ASR performs both reasoning and transcription in a single pass, and it supports user-guided transcription by incorporating user-provided context alongside the model’s self-generated reasoning.
  • To reduce the speech-to-text modality gap, the work proposes a CTC-guided Modality Adapter that uses CTC non-blank token probabilities to align speech encoder outputs with the LLM’s textual latent space.
  • Experiments report that CoT-ASR reduces word error rate (WER) by 8.7% and entity error rate (EER) by 16.9% relative to standard LLM-based ASR.

Abstract

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).