Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
arXiv cs.CL / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why aligned LLMs can show “over-refusal,” incorrectly declining harmless instructions that are similar in form to harmful ones.
- It argues that simple global refusal-direction ablation only incidentally improves over-refusal because it fails to preserve the broader, task-level refusal mechanism.
- Mechanistic analysis suggests harmful-refusal directions are largely task-agnostic and well-approximated by a single global vector, while over-refusal directions are task-dependent and occupy a higher-dimensional subspace.
- Linear probing results indicate the two refusal behaviors are representationally distinct and emerge in later transformer layers rather than being captured early.
- The authors conclude that correcting over-refusal likely requires task-specific geometric interventions rather than one-size-fits-all direction removal.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to