Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

arXiv cs.LG / 5/6/2026

📰 NewsModels & Research

Key Points

  • The paper introduces a “pairwise matrix protocol” for sparse autoencoder (SAE) interpretability that co-varies the steering coefficient with joint conditions, aiming to better identify causal axes than the standard one-corner (single-feature) inspection.
  • Using Qwen3-1.7B-Instruct and replicating on Gemma-2-2B-it, the authors show that features inferred from top-activating contexts can be mischaracterized under single-feature steering, exemplified by an “AI self-disclaimer” feature that flips into a contemplative-philosopher-style response at higher coefficients.
  • They find cases where multiple near-orthogonal, cluster-specific features must be jointly suppressed to produce meaningful harm (e.g., degrading grounded recipe/engine explanations and introspective prompts), while suppressing each feature alone leaves controls largely intact.
  • A geometry-matched comparison (single-feature vs joint vs random directions) reveals direction-pattern-dependent output regimes, indicating coherence loss is not driven purely by perturbation magnitude and that joint suppression can uniquely yield placeholder-like outputs.
  • The pipeline also identifies a top causally responsible feature in Llama-3.1-8B-Instruct, supporting the protocol’s usefulness beyond the initial model pair and underscoring limitations of current validation practices.

Abstract

The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it. First, a feature labelled "AI self-disclaimer" from its top contexts produces an inverted U-shape under a coefficient sweep: at c=+500 the model substitutes a fluent contemplative-philosopher voice for the disclaimer. Two further features anchor the criterion (one monotonic, one pure breakdown). Second, three near-orthogonal cluster-specific features that individually steer a philosophy-of-mind register, jointly suppressed at c=-500, damage grounded composition on recipes and engine explanations as well as introspective prompts; single-feature suppression at the same magnitude leaves controls intact. Third, a matched-geometry comparison of single-feature, joint, and random-direction perturbations (norm ~1.55, cosine ~0.64) yields three distinct output regimes: single-feature substitutes strategy filler, random direction substitutes diverse content, joint suppression alone produces placeholder text. Coherence loss is direction-pattern-dependent, not magnitude-dependent. All three findings reproduce on Gemma with model-specific damage signatures; the matched-geometry control is CI-separated by ~10x. The pipeline also locates a top causally responsible feature in Llama-3.1-8B-Instruct.