DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
arXiv cs.CL / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies why safety-tuned LLMs often become “identity-blind,” leading to incorrect answers, unnecessary refusals, or generic equal-treatment behavior even when group differences are relevant and factually correct.
- It introduces a difference-awareness classification setup, distinguishing cases where answering correctly requires recognizing demographic differences versus cases where identical treatment is appropriate.
- The authors find that fine-tuning for higher accuracy can cause “harm drift,” where model explanations grow more harmful through elaboration, new problematic assumptions, or failure to flag harms identified by a baseline.
- To address this, they propose DART (Distill–Audit–Repair Training), which distills label-conditioned reasoning from a teacher, audits for harm drift relative to a baseline, and repairs issues using severity-weighted fine-tuning.
- Across eight benchmarks and 280 real-world queries, DART substantially boosts accuracy and difference-appropriate responses while reducing harm-drift cases and greatly lowering refusals, suggesting accuracy and safety can be aligned with explicit detection/repair mechanisms.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA