Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a cross-lingual auditing framework for assessing demographic bias in LLM-assisted emergency police dispatch by turning the Police Priority Dispatch System into a five-level ordinal classification task.
  • Using 19,800 outputs from 11 frontier LLMs across multiple scenario pairs, demographic cue types (religious appearance, gender, race), and two languages (English and Mandarin Chinese), the study finds bias appears mainly when incident severity is ambiguous and largely diminishes when priority is clear from call content.
  • Bias strength varies by demographic axis, with the largest effects for religious appearance, then gender, and then race, indicating that fairness risks are not uniform across attributes.
  • The work shows cross-lingual asymmetries: gender bias is amplified in Mandarin Chinese while race bias is more pronounced in English, and some scenarios even show counter-directional effects that complicate simple stereotype-amplification explanations.
  • The authors argue that bias is an interaction effect among demographic signals, contextual ambiguity, and language, and they provide the framework as scalable pre-deployment infrastructure for agencies evaluating candidate models.

Abstract

Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this context remains largely untested. Here we introduce a cross-lingual audit framework that operationalizes the Police Priority Dispatch System as a five-level ordinal classification task and applies a controlled minimal-pair design to isolate the effect of demographic cues. Across 19,800 model outputs spanning 11 frontier models, 15 scenario pairs, three demographic categories (religious appearance, gender, and race), and two languages (English and Mandarin Chinese), we find that demographic bias emerges systematically when incident severity is ambiguous but largely disappears when the operational priority is clearly determined by call content. Bias magnitude varies by demographic axis, with the largest effects observed for religious appearance, followed by gender and race. Critically, bias does not transfer consistently across languages: gender bias is substantially amplified in Mandarin Chinese, whereas race bias is more pronounced in English, revealing cross-lingual asymmetries that aggregate analyses obscure. In several scenarios, demographic cues produce counter-directional effects, challenging simple stereotype-amplification accounts of model behavior. These findings suggest that bias in LLM-based dispatch is not a fixed property of models alone, but arises from the interaction between demographic signals, contextual ambiguity, and language. Beyond these empirical results, the proposed framework provides a scalable audit infrastructure that enables deploying agencies to evaluate candidate models on jurisdiction-relevant scenarios prior to real-world adoption.