From Seeing it to Experiencing it: Interactive Evaluation of Intersectional Voice Bias in Human-AI Speech Interaction

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines how accent and perceived gender can produce intersectional bias in end-to-end speechLLM interactions, going beyond existing evaluations that focus on isolated outputs.
  • It differentiates quality-of-service disparities (e.g., off-topic or low-effort responses) from content-level bias in coherent responses, including alignment and verbosity effects.
  • The authors propose a two-part evaluation: a controlled, judge-free prompt-response analysis across six accents and two gender presentations, plus an interactive user study.
  • Using voice conversion, participants can experience identical content through different vocal identities, enabling direct measurement of perceived trust, acceptability, and perspective-taking.
  • Results across two studies (Interactive N=24, Observational N=19) show voice conversion increases trust/acceptability for benign responses and reveals accent×gender disparities in alignment and verbosity across SpeechLLMs.

Abstract

SpeechLLMs process spoken language directly from audio, but accent and vocal identity cues can lead to biased behaviour. Current bias evaluations often miss how such bias manifests in end-to-end speech interactions and how users experience it. We distinguish quality-of-service disparities (e.g., off-topic or low-effort responses) from content-level bias in coherent outputs, and examine intersectional effects of accent and perceived gender. In this work, we explore a two-part evaluation approach: (1) a controlled test cohort spanning six accents and two gender presentations, analysed with judge-free prompt-response metrics, and (2) an interactive study design using voice conversion to let users experience identical content through different vocal identities. Across two studies (Interactive, N=24; Observational, N=19), we find that voice conversion increases trust and acceptability for benign responses and encourages perspective-taking, while automated analysis in search of quality-of-service disparities, reveals {accent x gender} disparities in alignment and verbosity across SpeechLLMs. These results highlight voice conversion for probing and experiencing intersectional voice bias while our evaluation suite provides richer bias evaluations for spoken conversational AI.