Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
arXiv cs.CL / 3/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies whether adding speech as a native modality in LLMs (SpeechLLMs) improves speech-to-text translation quality compared with traditional cascaded pipelines that use speech foundation models plus downstream text models.
- It introduces “Hearing to Translate,” the first comprehensive benchmark suite that evaluates 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade baselines across 16 benchmarks, 13 language pairs, and 9 difficult conditions (e.g., disfluency, noise, and long-form audio).
- Overall results show cascaded systems are still the most reliable approach, but the newest SpeechLLMs can match or outperform cascades in multiple settings.
- The analysis also finds that speech foundation models (SFMs) alone lag behind both cascades and end-to-end LLM integration, suggesting that strong LLM integration is crucial for high-quality translation.
Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere
Dev.to