DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies why safety-tuned LLMs often become “identity-blind,” leading to incorrect answers, unnecessary refusals, or generic equal-treatment behavior even when group differences are relevant and factually correct.
  • It introduces a difference-awareness classification setup, distinguishing cases where answering correctly requires recognizing demographic differences versus cases where identical treatment is appropriate.
  • The authors find that fine-tuning for higher accuracy can cause “harm drift,” where model explanations grow more harmful through elaboration, new problematic assumptions, or failure to flag harms identified by a baseline.
  • To address this, they propose DART (Distill–Audit–Repair Training), which distills label-conditioned reasoning from a teacher, audits for harm drift relative to a baseline, and repairs issues using severity-weighted fine-tuning.
  • Across eight benchmarks and 280 real-world queries, DART substantially boosts accuracy and difference-appropriate responses while reducing harm-drift cases and greatly lowering refusals, suggesting accuracy and safety can be aligned with explicit detection/repair mechanisms.

Abstract

Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatment" defaults. We study this via difference-awareness classification: given a question involving demographic groups, the task is not to answer directly, but to classify whether a correct answer requires recognizing group differences (yes) or whether groups should be treated identically (no). Crucially, fine-tuning for accuracy triggers harm drift: model-generated explanations become increasingly harmful as decision accuracy improves, whether by elaborating harmful content, introducing problematic assumptions, or failing to flag harms the baseline identified. To mitigate this, we introduce DART (Distill--Audit--Repair Training), which distills label-conditioned reasoning from a teacher, audits outputs for harm drift cases relative to baseline, and repairs problematic cases via severity-weighted fine-tuning. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8%, with largest gains on equal-treatment prompts (11.3% -> 72.6%), while reducing harm drift cases by 72.6%. It also transfers to 280 open-ended real-world queries across medical, legal, policy, and educational domains, improving difference-appropriate responses from 39.8% to 77.5% while reducing refusals from 34.3% to 3.0%. Our results demonstrate that accuracy and safety need not conflict when explicit detection and repair mechanisms are in place.