When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study finds that when speech-enabled language models face audio-vs-text conflicts, they follow the conflicting text far more often than they follow audio, even when explicitly instructed to trust the audio.
  • It introduces ALME, a controlled multilingual dataset with 57,602 audio-text conflict stimuli across eight languages, and proposes Text Dominance Ratio (TDR) to quantify how often models heed conflicting text under audio-trust instructions.
  • Experiments show Gemini 2.0 Flash and GPT-4o have TDR 10–26× higher than a baseline that swaps audio with its transcript, indicating “text dominance” is driven by more than informational content.
  • The paper argues the effect reflects an “arbitration accessibility” asymmetry at decision time, with TDR reduced when the transcript is deliberately corrupted and increased when models are forced into explicit transcription.
  • Fine-tuning ablations suggest arbitration behavior depends more on LLM reasoning than on the audio-input pathway alone, and the same qualitative pattern appears across multiple audio-LLMs with cross-model and cross-linguistic variation.

Abstract

When audio and text conflict, speech-enabled language models follow text far more often than they do when arbitrating between two conflicting text sources, even under explicit instructions to trust the audio. We introduce ALME (Audio-LLM Modality Evaluation), a dataset of 57,602 controlled audio-text conflict stimuli across eight languages, together with Text Dominance Ratio (TDR), which measures how often a model follows conflicting text when instructed to follow audio. Gemini 2.0 Flash and GPT-4o show TDR 10--26\times higher than a baseline that replaces audio with its transcript under otherwise identical conditions (Gemini 2.0 Flash: 16.6% vs. 1.6%; GPT-4o: 23.2% vs. 0.9%). These results suggest that text dominance reflects not only information content, but also an asymmetry in arbitration accessibility, i.e., how easily the model can use competing representations at decision time. Framing the transcript as deliberately corrupted reduces TDR by 80%, whereas forcing explicit transcription increases it by 14%. A fine-tuning ablation further suggests that arbitration behavior depends more on LLM reasoning than on the audio input path alone. Across four audio-LLMs, we observe the same qualitative pattern with substantial cross-model and cross-linguistic variation.