Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

arXiv cs.LG / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current alignment evaluations focused on detection or refusals miss the critical routing step from detection to policy, which largely determines model behavior.
Probing results show that accuracy on political probes can reach 100% even for non-generalizable categories, so held-out generalization is the real diagnostic test.
Surgical ablations reveal that routing from political sensitivity to censorship is lab- and model-specific; removing political-sensitivity direction can restore factual outputs in many models, though some architectures entangle knowledge with censorship.
Cross-model transfer of routing behavior fails, indicating that routing geometry is not portable across models or labs.
Refusal-based benchmarks can miss censorship entirely as some models shift from hard refusals to narrative steering; the authors propose a three-stage framework: detect, route, generate, urging evaluations to audit routing and generation rather than only detection or refusal.

Abstract

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/20DailyView insight →

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

SYNCAI

Dev.to

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Key Points

Abstract

💡 Insights using this article

Related Articles

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

SYNCAI

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer