Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
arXiv cs.LG / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current alignment evaluations focused on detection or refusals miss the critical routing step from detection to policy, which largely determines model behavior.
- Probing results show that accuracy on political probes can reach 100% even for non-generalizable categories, so held-out generalization is the real diagnostic test.
- Surgical ablations reveal that routing from political sensitivity to censorship is lab- and model-specific; removing political-sensitivity direction can restore factual outputs in many models, though some architectures entangle knowledge with censorship.
- Cross-model transfer of routing behavior fails, indicating that routing geometry is not portable across models or labs.
- Refusal-based benchmarks can miss censorship entirely as some models shift from hard refusals to narrative steering; the authors propose a three-stage framework: detect, route, generate, urging evaluations to audit routing and generation rather than only detection or refusal.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)
Dev.to

SYNCAI
Dev.to
How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024
Dev.to
When AI Grows Up: Identity, Memory, and What Persists Across Versions
Dev.to
AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)
Dev.to