Do Sparse Autoencoders Capture Concept Manifolds?

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper questions a common SAE assumption that concepts align with independent linear directions and instead argues that many concepts lie on low-dimensional geometric manifolds.
It proposes a theoretical framework for when and how sparse autoencoders can capture manifolds, distinguishing two mechanisms: a global scheme using a compact set of atoms spanning the whole manifold, and a local scheme using features that tile restricted regions of the geometry.
The authors provide empirical evidence that SAEs often recover continuous manifold structure poorly, blending global and local solutions in what they term a “dilution” regime.
The dilution behavior helps explain why manifold structure is rarely apparent when inspecting individual learned concepts, motivating post-hoc unsupervised methods to discover coherent groups of atoms rather than isolated directions.
Overall, the findings suggest interpretability in future representation learning should focus on geometric objects (manifold-like units) instead of single feature directions.

Abstract

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.