Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

arXiv cs.LG / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors train nine sparse autoencoders on the residual stream of the 35B MoE model Qwen 3.5-35B-A3B with a hybrid GatedDeltaNet/attention architecture to identify and steer five agentic traits.
They use linear probes on SAE latent activations and reconstruct the probe weights through the SAE decoder to generate continuous steering vectors in the model's native activation space, bypassing top-k discretization and enabling inference-time behavioral intervention without retraining.
In 1,800 agent rollouts across 50 scenarios and 36 conditions, autonomy steering at multiplier 2 produces Cohen's d of 1.01 (p < 0.0001), shifting the model from asking the user for help to proactively executing code and web searching.
Cross-trait analysis reveals that all five steering vectors mainly modulate a single dominant agency axis—the disposition to act independently versus defer to the user—with trait-specific effects appearing only as secondary modulations in tool-type composition and dose-response shape.
The tool-use vector shows a moderate effect (d = 0.39) while the risk-calibration vector primarily suppresses behavior, and steering during autoregressive decoding has zero effect (p > 0.35), indicating that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Abstract

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer