Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
arXiv cs.LG / 5/6/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current LLM bias evaluations are too binary, and instead proposes a seven-tier, context-sensitive stress test to capture how bias emerges gradually.
- It introduces the Moral Sensitivity Index (MSI) to quantify the probability of biased outputs, and reports different behavioral signatures across Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5 under abstract and socially/institutionally loaded scenario framing.
- In particular, Gemini 1.5 reaches 72.7% MSI by Tier 5 in socioeconomic injustice contexts, while Claude shows sharp suppression consistent with identity-based safety training effects.
- The authors then validate these behavioral patterns mechanistically using techniques such as logit lens, attention analysis, activation patching, and semantic probing, finding a “U-curve” where small models show strong criminal bias, instruction tuning removes it, and reasoning distillation reintroduces bias.
- The study claims that the same socially loaded cues correlate with the bias-driving circuits found in the mechanistic analysis, providing cross-stage validation of the MSI results.
Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide
Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'
Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA