Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits
arXiv cs.LG / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The authors train nine sparse autoencoders on the residual stream of the 35B MoE model Qwen 3.5-35B-A3B with a hybrid GatedDeltaNet/attention architecture to identify and steer five agentic traits.
- They use linear probes on SAE latent activations and reconstruct the probe weights through the SAE decoder to generate continuous steering vectors in the model's native activation space, bypassing top-k discretization and enabling inference-time behavioral intervention without retraining.
- In 1,800 agent rollouts across 50 scenarios and 36 conditions, autonomy steering at multiplier 2 produces Cohen's d of 1.01 (p < 0.0001), shifting the model from asking the user for help to proactively executing code and web searching.
- Cross-trait analysis reveals that all five steering vectors mainly modulate a single dominant agency axis—the disposition to act independently versus defer to the user—with trait-specific effects appearing only as secondary modulations in tool-type composition and dose-response shape.
- The tool-use vector shows a moderate effect (d = 0.39) while the risk-calibration vector primarily suppresses behavior, and steering during autoregressive decoding has zero effect (p > 0.35), indicating that behavioral commitments are computed during prefill in GatedDeltaNet architectures.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA