When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study finds that when speech-enabled language models face audio-vs-text conflicts, they follow the conflicting text far more often than they follow audio, even when explicitly instructed to trust the audio.
It introduces ALME, a controlled multilingual dataset with 57,602 audio-text conflict stimuli across eight languages, and proposes Text Dominance Ratio (TDR) to quantify how often models heed conflicting text under audio-trust instructions.
Experiments show Gemini 2.0 Flash and GPT-4o have TDR 10–26× higher than a baseline that swaps audio with its transcript, indicating “text dominance” is driven by more than informational content.
The paper argues the effect reflects an “arbitration accessibility” asymmetry at decision time, with TDR reduced when the transcript is deliberately corrupted and increased when models are forced into explicit transcription.
Fine-tuning ablations suggest arbitration behavior depends more on LLM reasoning than on the audio-input pathway alone, and the same qualitative pattern appears across multiple audio-LLMs with cross-model and cross-linguistic variation.

Abstract

When audio and text conflict, speech-enabled language models follow text far more often than they do when arbitrating between two conflicting text sources, even under explicit instructions to trust the audio. We introduce ALME (Audio-LLM Modality Evaluation), a dataset of 57,602 controlled audio-text conflict stimuli across eight languages, together with Text Dominance Ratio (TDR), which measures how often a model follows conflicting text when instructed to follow audio. Gemini 2.0 Flash and GPT-4o show TDR 10--26

\times

higher than a baseline that replaces audio with its transcript under otherwise identical conditions (Gemini 2.0 Flash: 16.6% vs. 1.6%; GPT-4o: 23.2% vs. 0.9%). These results suggest that text dominance reflects not only information content, but also an asymmetry in arbitration accessibility, i.e., how easily the model can use competing representations at decision time. Framing the transcript as deliberately corrupted reduces TDR by 80%, whereas forcing explicit transcription increases it by 14%. A fine-tuning ablation further suggests that arbitration behavior depends more on LLM reasoning than on the audio input path alone. Across four audio-LLMs, we observe the same qualitative pattern with substantial cross-model and cross-linguistic variation.

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Scaffolded Test-First Prompting: Get Correct Code From the First Run

Dev.to

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Key Points

Abstract

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Scaffolded Test-First Prompting: Get Correct Code From the First Run

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer