Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

arXiv cs.CL / 4/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study benchmarks nine large language model (LLM) decoder approaches for speech recognition to test whether text-derived priors reduce or worsen demographic bias across ethnicity, accent, gender, age, and first language.
  • Using controlled prompting on ~43,000 utterances from Common Voice 24 and Meta’s Fair-Speech, the authors find that LLM decoders generally do not amplify racial bias, with Granite-8B showing the best ethnicity fairness, while Whisper can exhibit severe, non-monotonic hallucination behavior on Indian-accented speech.
  • Under 12 types of acoustic degradation, the researchers observe that extreme degradation can compress fairness gaps as all groups converge to high WER, but silence injection can substantially amplify Whisper’s accent bias due to demographic-selective hallucination.
  • The paper also reports that Whisper is prone to catastrophic repetition loops under masking, whereas explicit-LLM decoders produce far fewer insertions and near-zero repetition; high-compression audio encoding can reintroduce repetition problems even for LLM decoders.
  • Overall, the results indicate that the design of the audio encoder (and related robustness to audio artifacts) is a more important driver of equitable, robust speech recognition than simply scaling LLM decoders.

Abstract

As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.