Paper: https://arxiv.org/abs/2604.04385
I've been trying to understand where refusal actually lives. How it works mechanistically. Arditi et al showed refusal can be steered with a single direction. What I looked at here is the mechanistic question: what circuit creates and amplifies that direction?
Main result: Across 12 models from 6 labs, I keep finding a sparse gate-amplifier pattern.
A mid-layer 'gate' attention head reads a detection-layer representation and writes a routing vector. Later 'amplifier' attention heads then boost that signal towards refusal / censorship behavior.
In smaller models, this usually looks like one main gate head + a few amplifier heads. In larger models, it spreads into bands of heads across adjacent layers.
A few things surprised me:
- The gate looks unimportant if you just use output-level DLA. In Qwen3-8B, the gate contributes under 1% of output DLA, so it does not look like a top attention head.
- But it is causally necessary. Interchange testing identifies the gate, and knocking it out suppresses downstream amplifiers. (The paper explains how interchange testing works)
- Scaling changes how you find it. Per-head ablation weakens a lot as models get bigger (like up to 58x in the tested scaling model pairs). By 72B, top per-head ablation looks like noise. But interchange still finds the trigger component.
- Simple bijection encodings can break the routing trigger. If the model is taught a substitution cipher in-context and the same prompts are then encoded through that cipher, the gate’s necessity collapses and the model switches to puzzle-solving instead of refusal.
The interpretation I currently favor is:
- detection and policy routing are separate computations
- the refusal routing circuit commits early
- if the input fails to instantiate the right gate-readable representation at that stage, the later policy never properly binds
A result I found especially interesting is that you can partially restore refusal by injecting the plaintext gate activation back into the cipher forward pass. In Phi-4-mini, that restores refusal in 48% of cases, which suggests the failure is specifically at the routing trigger rather than the whole downstream computation being unusable.
Code, reproducibility guide, and saved results all linked in the paper.
[link] [comments]



