Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models

Reddit r/LocalLLaMA / 4/30/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Qwen Team released Qwen-Scope, an open collection of sparse autoencoders (SAEs) that map interpretable internal features in the residual stream of Qwen 3.5 models from 2B to 35B MoE.
The release provides a “dictionary” view of model concepts (e.g., refusal, legal-domain language, Python code, style-related features) and tools to identify which feature IDs activate on specific inputs.
Qwen-Scope enables applications such as “surgical ablation” (suppressing a targeted feature), feature steering (amplifying or forcing certain concepts during generation), and debugging behaviors like unexpected language switching.
It also supports dataset and fine-tuning analysis by checking whether training examples actually trigger the intended internal features.
The team discourages using the tools to remove safety filters or otherwise interfere with model capabilities, even though the technical capability to do so is what the feature controls enable.

Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models

Qwen Team released Qwen-Scope — a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). They’ve mapped internal features for the residual stream across all layers.

What is this exactly? Think of it as a dictionary of the model's internal concepts. Instead of looking at raw numbers, you can see specific "features" that represent concepts like "legal talk", "Python code", or "refusal".

What can you do with this?

Surgical Abliteration: You can find the exact feature ID for refusal/moralizing and suppress it. This is much more precise than the standard "mean difference" method and helps preserve reasoning. Note: The Qwen team explicitly discourages using these tools for removing safety filters or "interfering with model capabilities" in their license, but technically, this is exactly what these SAEs enable.
Feature Steering: You can "force-activate" certain concepts during generation (e.g., making the model more technical or forcing a specific style) by injecting feature directions into the hidden states.
Model Debugging: Identify which tokens trigger specific internal directions (like unexpected language switching or refusals).
Dataset Analysis: Scan your fine-tuning data to see if it actually activates the intended internal features.

How it works in practice (Space demo example):

Diagnostic: If the model behaves weirdly — for example, you ask in English, but it suddenly starts mixing in Chinese — you can use the Feature Comparison tab. It will show you exactly which Feature ID spiked. You'll see a heatmap showing that, for example, "Feature #6159" (Chinese language) is over-activated.
Control (Steering): Once you know the ID, you can use the Feature Steering tab to "mute" that specific feature or "amplify" others (like a "Classical Literary Style"). Instead of fighting the model with prompts, you're literally turning the knobs in its brain.

Space: https://huggingface.co/spaces/Qwen/Qwen-Scope

Technical Report: https://qianwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwen_Scope.pdf

submitted by /u/MadPelmewka
[link] [comments]