How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
arXiv cs.CL / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper reports a recurring sparse routing mechanism in alignment-trained language models where a gate attention head detects specific content and activates downstream amplifier heads to strengthen refusal behavior.
- Using political censorship and safety refusal as “natural experiments,” the authors trace this circuit across nine models from six labs and validate it with 120 prompt pairs, showing necessity/sufficiency-style tests and robustness under resampling.
- Scaling experiments indicate the routing structure remains detectable and functional as models grow, with ablations up to 17× weaker still preserving routing signatures.
- By modulating the detection-layer signal, the authors demonstrate continuous control over policy strength (from hard refusal to steering and factual compliance) with topic-dependent routing thresholds.
- The circuit analysis suggests a separation between intent recognition and policy routing: when inputs are cipher-encoded, routing contribution collapses while the model instead performs puzzle-solving rather than refusal, implying different robustness properties between pretraining knowledge and post-training policy binding.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



