Intersectional Fairness in Large Language Models

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates intersectional fairness in six large language models by testing them on ambiguous and disambiguated prompts from two benchmark datasets.
  • In ambiguous contexts, the models’ generally strong performance reduces the usefulness of fairness metrics because many predictions are sparse or “unknown,” limiting signal about bias.
  • In disambiguated contexts, accuracy is affected by stereotype alignment: models are more accurate when the correct answer supports a stereotype and less accurate when it contradicts one.
  • The stereotype-directional bias is especially strong for race–gender intersections, and subgroup fairness metrics show uneven outcome distributions even when some measured disparities appear small.
  • Repeated runs reveal variability in response consistency, including responses that align with stereotypes, leading the authors to conclude that none of the evaluated LLMs is reliably fair across intersectional settings.

Abstract

Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.