Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

arXiv cs.CL / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether LLM performance gaps by demographic arise from explicitly stated identity or from implicitly conveyed socio-linguistic/dialect signals.
  • Using factorial tests on 24,000+ responses from Gemma-3-12B and Qwen-3-VL-8B, it compares prompts with explicit user profiles (e.g., stated Black identity) against prompts using implicit dialect cues (e.g., AAVE, Singlish) across sensitive domains.
  • It finds a safety paradox: explicit identity prompts lead to higher refusal rates and lower semantic similarity to reference Black user texts, while implicit dialect cues nearly eliminate refusals and increase semantic similarity.
  • The study warns that this “dialect jailbreak” reduces content sanitization, revealing brittle safety alignment that over-relies on explicit keywords and creates unequal, bifurcated user experiences.
  • Overall, the work highlights a core alignment tension between equitable treatment and linguistic diversity, calling for safety mechanisms that generalize beyond explicit cues.

Abstract

As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.