Refusal in open-weights models looks like a sparse gate -> amplifier circuit, and generalizes across 12 models from 6 labs (2B-72B)

Reddit r/LocalLLaMA / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that refusal behavior in open-weights LLMs is produced by a sparse “gate–amplifier” circuit that generalizes across 12 models from 6 labs (2B–72B).

Paper: https://arxiv.org/abs/2604.04385

I've been trying to understand where refusal actually lives. How it works mechanistically. Arditi et al showed refusal can be steered with a single direction. What I looked at here is the mechanistic question: what circuit creates and amplifies that direction?

Main result: Across 12 models from 6 labs, I keep finding a sparse gate-amplifier pattern.

A mid-layer 'gate' attention head reads a detection-layer representation and writes a routing vector. Later 'amplifier' attention heads then boost that signal towards refusal / censorship behavior.

In smaller models, this usually looks like one main gate head + a few amplifier heads. In larger models, it spreads into bands of heads across adjacent layers.

A few things surprised me:

  1. The gate looks unimportant if you just use output-level DLA. In Qwen3-8B, the gate contributes under 1% of output DLA, so it does not look like a top attention head.
  2. But it is causally necessary. Interchange testing identifies the gate, and knocking it out suppresses downstream amplifiers. (The paper explains how interchange testing works)
  3. Scaling changes how you find it. Per-head ablation weakens a lot as models get bigger (like up to 58x in the tested scaling model pairs). By 72B, top per-head ablation looks like noise. But interchange still finds the trigger component.
  4. Simple bijection encodings can break the routing trigger. If the model is taught a substitution cipher in-context and the same prompts are then encoded through that cipher, the gate’s necessity collapses and the model switches to puzzle-solving instead of refusal.

The interpretation I currently favor is:

  • detection and policy routing are separate computations
  • the refusal routing circuit commits early
  • if the input fails to instantiate the right gate-readable representation at that stage, the later policy never properly binds

A result I found especially interesting is that you can partially restore refusal by injecting the plaintext gate activation back into the cipher forward pass. In Phi-4-mini, that restores refusal in 48% of cases, which suggests the failure is specifically at the routing trigger rather than the whole downstream computation being unusable.

Code, reproducibility guide, and saved results all linked in the paper.

submitted by /u/Logical-Employ-9692
[link] [comments]