Invisible Influences: Investigating Implicit Intersectional Biases through Persona Engineering in Large Language Models

arXiv cs.CL / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current LLM bias audits using static, embedding-based association tests can miss how bias changes when models adopt different social persona contexts.
It introduces BADx, a scalable metric for quantifying persona-induced bias amplification using differential bias measures (based on CEAT/I-WEAT/I-SEAT) plus a Persona Sensitivity Index and volatility, with local explainability via LIME.
The study runs two tasks: establishing static bias baselines and then applying six persona frames (marginalized vs. structurally advantaged) to measure context-dependent effects across models.
Experiments across GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet, and Gemma-3n E4B show persona context significantly modulates bias, with notable differences in sensitivity, amplification, and stability/volatility by model.
The authors conclude that BADx outperforms static methods by surfacing dynamic implicit intersectional biases that static audits may overlook.

Abstract

Large Language Models (LLMs) excel at human-like language generation but often embed and amplify implicit, intersectional biases, especially under persona-driven contexts. Existing bias audits rely on static, embedding-based tests (CEAT, I-WEAT, I-SEAT) that quantify absolute association strengths. We show that they have limitations in capturing dynamic shifts when models adopt social roles. We address this gap by introducing the Bias Amplification Differential and Explainability Score (BADx): a novel, scalable metric that measures persona-induced bias amplification and integrates local explainability insights. BADx comprises three components - differential bias scores (BAD, based on CEAT, I-WEAT, I-SEAT),Persona Sensitivity Index (PSI), and Volatility (Standard Deviation), augmented by LIME-based analysis for emphasizing explainability. This study is divided and performed as two different tasks. Task 1 establishes static bias baselines, and Task 2 applies six persona frames (marginalized and structurally advantaged) to measure BADx, PSI, and volatility. This is studied across five state-of-the-art LLMs (GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet and Gemma-3n E4B). Results show persona context significantly modulates bias. GPT-4o exhibits high sensitivity and volatility; DeepSeek-R1 suppresses bias but with erratic volatility; LLaMA-4 maintains low volatility and a stable bias profile with limited amplification; Claude 4.0 Sonnet achieves balanced modulation; and Gemma-3n E4B attains the lowest volatility with moderate amplification. BADx performs better than static methods by revealing context-sensitive biases overlooked in static methods. Our unified method offers a systematic way to detect dynamic implicit intersectional bias in five popular LLMs.