Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
arXiv cs.LG / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current alignment evaluations focused on detection or refusals miss the critical routing step from detection to policy, which largely determines model behavior.
- Probing results show that accuracy on political probes can reach 100% even for non-generalizable categories, so held-out generalization is the real diagnostic test.
- Surgical ablations reveal that routing from political sensitivity to censorship is lab- and model-specific; removing political-sensitivity direction can restore factual outputs in many models, though some architectures entangle knowledge with censorship.
- Cross-model transfer of routing behavior fails, indicating that routing geometry is not portable across models or labs.
- Refusal-based benchmarks can miss censorship entirely as some models shift from hard refusals to narrative steering; the authors propose a three-stage framework: detect, route, generate, urging evaluations to audit routing and generation rather than only detection or refusal.
Related Articles
I Was Wrong About AI Coding Assistants. Here's What Changed My Mind (and What I Built About It).
Dev.to

Interesting loop
Reddit r/LocalLLaMA
Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants
Reddit r/LocalLLaMA
A supervisor or "manager" Al agent is the wrong way to control Al
Reddit r/artificial
FeatherOps: Fast fp8 matmul on RDNA3 without native fp8
Reddit r/LocalLLaMA