AI Navigate

Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation

arXiv cs.CL / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • A large-scale evaluation of 19 LLMs across two experiments with more than 4 million annotation judgments shows that automated text annotation systematically reflects racial stereotypes, even when annotating neutral text.
  • In the names-based experiment spanning 39 tasks, texts with Black-associated names were rated as more aggressive and more gossipy by the majority of models, while Asian names were perceived as more intelligent but less confident and less sociable.
  • In the dialect experiment, the same sentence written in African American Vernacular English was judged significantly less professional, less indicative of an educated speaker, more toxic, and more angry by nearly all models.
  • The findings imply that using LLMs as automated annotators can embed socially patterned biases into datasets and measurements underpinning research, governance, and decision-making, with fine-tuning sometimes overcorrecting in hireability for minority-named applicants.

Abstract

Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways that mirror racial stereotypes. In a names-based experiment spanning 39 annotation tasks, texts containing names associated with Black individuals are rated as more aggressive by 18 of 19 models and more gossipy by 18 of 19. Asian names produce a bamboo-ceiling profile: 17 of 19 models rate individuals as more intelligent, while 18 of 19 rate them as less confident and less sociable. Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined. In a matched dialect experiment, the same sentence is judged significantly less professional (all 19 models, mean gap -0.774), less indicative of an educated speaker (-0.688), more toxic (18/19), and more angry (19/19) when written in African American Vernacular English rather than Standard American English. A notable exception occurs for name-based hireability, where fine-tuning appears to overcorrect, systematically favoring minority-named applicants. These findings suggest that using LLMs as automated annotators can embed socially patterned biases directly into the datasets and measurements that increasingly underpin research, governance, and decision-making.