Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

arXiv cs.LG / 5/6/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces a “pairwise matrix protocol” for sparse autoencoder (SAE) interpretability that co-varies the steering coefficient with joint conditions, aiming to better identify causal axes than the standard one-corner (single-feature) inspection.
Using Qwen3-1.7B-Instruct and replicating on Gemma-2-2B-it, the authors show that features inferred from top-activating contexts can be mischaracterized under single-feature steering, exemplified by an “AI self-disclaimer” feature that flips into a contemplative-philosopher-style response at higher coefficients.
They find cases where multiple near-orthogonal, cluster-specific features must be jointly suppressed to produce meaningful harm (e.g., degrading grounded recipe/engine explanations and introspective prompts), while suppressing each feature alone leaves controls largely intact.
A geometry-matched comparison (single-feature vs joint vs random directions) reveals direction-pattern-dependent output regimes, indicating coherence loss is not driven purely by perturbation magnitude and that joint suppression can uniquely yield placeholder-like outputs.
The pipeline also identifies a top causally responsible feature in Llama-3.1-8B-Instruct, supporting the protocol’s usefulness beyond the initial model pair and underscoring limitations of current validation practices.

Abstract

The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it. First, a feature labelled "AI self-disclaimer" from its top contexts produces an inverted U-shape under a coefficient sweep: at c=+500 the model substitutes a fluent contemplative-philosopher voice for the disclaimer. Two further features anchor the criterion (one monotonic, one pure breakdown). Second, three near-orthogonal cluster-specific features that individually steer a philosophy-of-mind register, jointly suppressed at c=-500, damage grounded composition on recipes and engine explanations as well as introspective prompts; single-feature suppression at the same magnitude leaves controls intact. Third, a matched-geometry comparison of single-feature, joint, and random-direction perturbations (norm ~1.55, cosine ~0.64) yields three distinct output regimes: single-feature substitutes strategy filler, random direction substitutes diverse content, joint suppression alone produces placeholder text. Coherence loss is direction-pattern-dependent, not magnitude-dependent. All three findings reproduce on Gemma with model-specific damage signatures; the matched-geometry control is CI-separated by ~10x. The pipeline also locates a top causally responsible feature in Llama-3.1-8B-Instruct.

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

We measured the real cost of running a GPT-5.4 chatbot on live websites

Reddit r/artificial

AI ecosystems in China and US grow apart amid tech war

SCMP Tech

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Key Points

Abstract

Related Articles

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

We measured the real cost of running a GPT-5.4 chatbot on live websites

AI ecosystems in China and US grow apart amid tech war

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer