Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits
arXiv cs.LG / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The authors train nine sparse autoencoders on the residual stream of the 35B MoE model Qwen 3.5-35B-A3B with a hybrid GatedDeltaNet/attention architecture to identify and steer five agentic traits.
- They use linear probes on SAE latent activations and reconstruct the probe weights through the SAE decoder to generate continuous steering vectors in the model's native activation space, bypassing top-k discretization and enabling inference-time behavioral intervention without retraining.
- In 1,800 agent rollouts across 50 scenarios and 36 conditions, autonomy steering at multiplier 2 produces Cohen's d of 1.01 (p < 0.0001), shifting the model from asking the user for help to proactively executing code and web searching.
- Cross-trait analysis reveals that all five steering vectors mainly modulate a single dominant agency axis—the disposition to act independently versus defer to the user—with trait-specific effects appearing only as secondary modulations in tool-type composition and dose-response shape.
- The tool-use vector shows a moderate effect (d = 0.39) while the risk-calibration vector primarily suppresses behavior, and steering during autoregressive decoding has zero effect (p > 0.35), indicating that behavioral commitments are computed during prefill in GatedDeltaNet architectures.
Related Articles
Automating the Chase: AI for Festival Vendor Compliance
Dev.to
MCP Skills vs MCP Tools: The Right Way to Configure Your Server
Dev.to
500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)
Dev.to
Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?
Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER