Speech LLMs are Contextual Reasoning Transcribers
arXiv cs.CL / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces chain-of-thought ASR (CoT-ASR), which has an LLM analyze input speech to generate contextual reasoning before producing transcription, aiming to better leverage LLM knowledge in speech recognition.
- CoT-ASR performs both reasoning and transcription in a single pass, and it supports user-guided transcription by incorporating user-provided context alongside the model’s self-generated reasoning.
- To reduce the speech-to-text modality gap, the work proposes a CTC-guided Modality Adapter that uses CTC non-blank token probabilities to align speech encoder outputs with the LLM’s textual latent space.
- Experiments report that CoT-ASR reduces word error rate (WER) by 8.7% and entity error rate (EER) by 16.9% relative to standard LLM-based ASR.
Related Articles

Black Hat Asia
AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story
Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure
Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts
MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening
Reddit r/artificial