How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports a recurring sparse routing mechanism in alignment-trained language models where a gate attention head detects specific content and activates downstream amplifier heads to strengthen refusal behavior.
  • Using political censorship and safety refusal as “natural experiments,” the authors trace this circuit across nine models from six labs and validate it with 120 prompt pairs, showing necessity/sufficiency-style tests and robustness under resampling.
  • Scaling experiments indicate the routing structure remains detectable and functional as models grow, with ablations up to 17× weaker still preserving routing signatures.
  • By modulating the detection-layer signal, the authors demonstrate continuous control over policy strength (from hard refusal to steering and factual compliance) with topic-dependent routing thresholds.
  • The circuit analysis suggests a separation between intent recognition and policy routing: when inputs are cipher-encoded, routing contribution collapses while the model instead performs puzzle-solving rather than refusal, implying different robustness properties between pretraining knowledge and post-training policy binding.

Abstract

We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head's routing contribution collapses (78% in Phi-4 at n=120) while the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.