When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making
arXiv cs.CL / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- We introduce ICE-Guard, a framework that applies intervention consistency testing to detect three types of spurious feature reliance in LLMs across 3,000 vignettes in 10 high-stakes domains, evaluating 11 LLMs from 8 families.
- The study identifies three bias types—demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements)—and finds authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%).
- Bias concentration varies by domain, with finance showing 22.6% authority bias and criminal justice showing only 2.8%.
- A structured decomposition approach, where the LLM extracts features and a deterministic rubric makes the final decision, reduces flip rates by up to 100% (median 49% across 9 models).
- The ICE-guided detect-diagnose-mitigate-verify loop achieves about 78% bias reduction via iterative prompt patching, and validation against COMPAS recidivism data suggests the benchmark provides a conservative estimate of real-world bias; code and data are publicly available.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial
Scaffolded Test-First Prompting: Get Correct Code From the First Run
Dev.to