Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

arXiv cs.LG / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors train nine sparse autoencoders on the residual stream of the 35B MoE model Qwen 3.5-35B-A3B with a hybrid GatedDeltaNet/attention architecture to identify and steer five agentic traits.
They use linear probes on SAE latent activations and reconstruct the probe weights through the SAE decoder to generate continuous steering vectors in the model's native activation space, bypassing top-k discretization and enabling inference-time behavioral intervention without retraining.
In 1,800 agent rollouts across 50 scenarios and 36 conditions, autonomy steering at multiplier 2 produces Cohen's d of 1.01 (p < 0.0001), shifting the model from asking the user for help to proactively executing code and web searching.
Cross-trait analysis reveals that all five steering vectors mainly modulate a single dominant agency axis—the disposition to act independently versus defer to the user—with trait-specific effects appearing only as secondary modulations in tool-type composition and dose-response shape.
The tool-use vector shows a moderate effect (d = 0.39) while the risk-calibration vector primarily suppresses behavior, and steering during autoregressive decoding has zero effect (p > 0.35), indicating that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Abstract

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer