Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether adding speech as a native modality in LLMs (SpeechLLMs) improves speech-to-text translation quality compared with traditional cascaded pipelines that use speech foundation models plus downstream text models.
It introduces “Hearing to Translate,” the first comprehensive benchmark suite that evaluates 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade baselines across 16 benchmarks, 13 language pairs, and 9 difficult conditions (e.g., disfluency, noise, and long-form audio).
Overall results show cascaded systems are still the most reliable approach, but the newest SpeechLLMs can match or outperform cascades in multiple settings.
The analysis also finds that speech foundation models (SFMs) alone lag behind both cascades and end-to-end LLM integration, suggesting that strong LLM integration is crucial for high-quality translation.

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Key Points

Abstract

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer