Refusal in open-weights models looks like a sparse gate -> amplifier circuit, and generalizes across 12 models from 6 labs (2B-72B)

Reddit r/LocalLLaMA / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that refusal behavior in open-weights LLMs is produced by a sparse “gate–amplifier” circuit that generalizes across 12 models from 6 labs (2B–72B).

I've been trying to understand where refusal actually lives. How it works mechanistically. Arditi et al showed refusal can be steered with a single direction. What I looked at here is the mechanistic question: what circuit creates and amplifies that direction?

Main result: Across 12 models from 6 labs, I keep finding a sparse gate-amplifier pattern.

A mid-layer 'gate' attention head reads a detection-layer representation and writes a routing vector. Later 'amplifier' attention heads then boost that signal towards refusal / censorship behavior.

In smaller models, this usually looks like one main gate head + a few amplifier heads. In larger models, it spreads into bands of heads across adjacent layers.

A few things surprised me:

The gate looks unimportant if you just use output-level DLA. In Qwen3-8B, the gate contributes under 1% of output DLA, so it does not look like a top attention head.
But it is causally necessary. Interchange testing identifies the gate, and knocking it out suppresses downstream amplifiers. (The paper explains how interchange testing works)
Scaling changes how you find it. Per-head ablation weakens a lot as models get bigger (like up to 58x in the tested scaling model pairs). By 72B, top per-head ablation looks like noise. But interchange still finds the trigger component.
Simple bijection encodings can break the routing trigger. If the model is taught a substitution cipher in-context and the same prompts are then encoded through that cipher, the gate’s necessity collapses and the model switches to puzzle-solving instead of refusal.

The interpretation I currently favor is:

detection and policy routing are separate computations
the refusal routing circuit commits early
if the input fails to instantiate the right gate-readable representation at that stage, the later policy never properly binds

A result I found especially interesting is that you can partially restore refusal by injecting the plaintext gate activation back into the cipher forward pass. In Phi-4-mini, that restores refusal in 48% of cases, which suggests the failure is specifically at the routing trigger rather than the whole downstream computation being unusable.

Code, reproducibility guide, and saved results all linked in the paper.

submitted by /u/Logical-Employ-9692
[link] [comments]

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

Refusal in open-weights models looks like a sparse gate -> amplifier circuit, and generalizes across 12 models from 6 labs (2B-72B)

Key Points

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer