AI Navigate

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

arXiv cs.LG / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current alignment evaluations focused on detection or refusals miss the critical routing step from detection to policy, which largely determines model behavior.
  • Probing results show that accuracy on political probes can reach 100% even for non-generalizable categories, so held-out generalization is the real diagnostic test.
  • Surgical ablations reveal that routing from political sensitivity to censorship is lab- and model-specific; removing political-sensitivity direction can restore factual outputs in many models, though some architectures entangle knowledge with censorship.
  • Cross-model transfer of routing behavior fails, indicating that routing geometry is not portable across models or labs.
  • Refusal-based benchmarks can miss censorship entirely as some models shift from hard refusals to narrative steering; the authors propose a three-stage framework: detect, route, generate, urging evaluations to audit routing and generation rather than only detection or refusal.

Abstract

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.