AI Navigate

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

arXiv cs.CL / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • We introduce ICE-Guard, a framework that applies intervention consistency testing to detect three types of spurious feature reliance in LLMs across 3,000 vignettes in 10 high-stakes domains, evaluating 11 LLMs from 8 families.
  • The study identifies three bias types—demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements)—and finds authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%).
  • Bias concentration varies by domain, with finance showing 22.6% authority bias and criminal justice showing only 2.8%.
  • A structured decomposition approach, where the LLM extracts features and a deterministic rubric makes the final decision, reduces flip rates by up to 100% (median 49% across 9 models).
  • The ICE-guided detect-diagnose-mitigate-verify loop achieves about 78% bias reduction via iterative prompt patching, and validation against COMPAS recidivism data suggests the benchmark provides a conservative estimate of real-world bias; code and data are publicly available.

Abstract

Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.