Can We Locate and Prevent Stereotypes in LLMs?

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines how harmful societal stereotypes are represented inside large language models, focusing on where such biases reside within the neural network.
It analyzes GPT-2 Small and Llama 3.2 using two methods: finding specific contrastive neuron activations associated with stereotypes, and identifying attention heads that strongly drive biased outputs.
The study aims to produce “bias fingerprints” that map stereotype-related internal mechanisms to better understand and eventually mitigate bias.
Its results are positioned as initial insights rather than a finalized mitigation system, highlighting gaps in current knowledge about internal bias localization.
The work supports efforts to prevent stereotype propagation by offering more interpretable targets for future bias-reduction techniques.

Abstract

Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.