Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
arXiv cs.CL / 4/24/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study benchmarks nine large language model (LLM) decoder approaches for speech recognition to test whether text-derived priors reduce or worsen demographic bias across ethnicity, accent, gender, age, and first language.
- Using controlled prompting on ~43,000 utterances from Common Voice 24 and Meta’s Fair-Speech, the authors find that LLM decoders generally do not amplify racial bias, with Granite-8B showing the best ethnicity fairness, while Whisper can exhibit severe, non-monotonic hallucination behavior on Indian-accented speech.
- Under 12 types of acoustic degradation, the researchers observe that extreme degradation can compress fairness gaps as all groups converge to high WER, but silence injection can substantially amplify Whisper’s accent bias due to demographic-selective hallucination.
- The paper also reports that Whisper is prone to catastrophic repetition loops under masking, whereas explicit-LLM decoders produce far fewer insertions and near-zero repetition; high-compression audio encoding can reintroduce repetition problems even for LLM decoders.
- Overall, the results indicate that the design of the audio encoder (and related robustness to audio artifacts) is a more important driver of equitable, robust speech recognition than simply scaling LLM decoders.
Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence
Dev.to

Context Engineering for Developers: A Practical Guide (2026)
Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to
AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now
Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to