Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
arXiv cs.LG / 5/6/2026
📰 NewsModels & Research
Key Points
- The paper introduces a “pairwise matrix protocol” for sparse autoencoder (SAE) interpretability that co-varies the steering coefficient with joint conditions, aiming to better identify causal axes than the standard one-corner (single-feature) inspection.
- Using Qwen3-1.7B-Instruct and replicating on Gemma-2-2B-it, the authors show that features inferred from top-activating contexts can be mischaracterized under single-feature steering, exemplified by an “AI self-disclaimer” feature that flips into a contemplative-philosopher-style response at higher coefficients.
- They find cases where multiple near-orthogonal, cluster-specific features must be jointly suppressed to produce meaningful harm (e.g., degrading grounded recipe/engine explanations and introspective prompts), while suppressing each feature alone leaves controls largely intact.
- A geometry-matched comparison (single-feature vs joint vs random directions) reveals direction-pattern-dependent output regimes, indicating coherence loss is not driven purely by perturbation magnitude and that joint suppression can uniquely yield placeholder-like outputs.
- The pipeline also identifies a top causally responsible feature in Llama-3.1-8B-Instruct, supporting the protocol’s usefulness beyond the initial model pair and underscoring limitations of current validation practices.
Related Articles

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)
Reddit r/LocalLLaMA

We measured the real cost of running a GPT-5.4 chatbot on live websites
Reddit r/artificial

AI ecosystems in China and US grow apart amid tech war
SCMP Tech