Provably Extracting the Features from a General Superposition

arXiv stat.ML / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates whether and how interpretable feature directions can be extracted when they are “in superposition,” framed as an overcomplete setting where the number of features exceeds the dimension (n > d).
It considers a model with query access to f(x)=∑_{i=1}^n σ_i(v_i^T x), aiming to recover the hidden feature directions v_i and the overall function f.
The authors present an efficient query algorithm that can identify all non-degenerate feature directions and reconstruct f from noisy oracle access.
A key contribution is generality: the method tolerates essentially arbitrary superpositions as long as feature directions are not nearly identical and it supports arbitrary (general) response functions σ_i.
The approach works by searching in Fourier space and iteratively refining candidate sets to locate the hidden directions v_i.

Abstract

It is widely believed that complex machine learning models generally encode features through linear representations. This is the foundational hypothesis behind a vast body of work on interpretability. A key challenge toward extracting interpretable features, however, is that they exist in superposition. In this work, we study the question of extracting features in superposition from a learning theoretic perspective. We start with the following fundamental setting: we are given query access to a function \[ f(x)=\sum_{i=1}^n \sigma_i(v_i^\top x), \] where each unit vector

v_i

encodes a feature direction and

\sigma_i:\R\to\R

is an arbitrary response function and our goal is to recover the

v_i

and the function

f

. In learning-theoretic terms, superposition refers to the \emph{overcomplete regime}, when the number of features is larger than the underlying dimension (i.e.

n > d

), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to

f

, identifies all feature directions whose responses are non-degenerate and reconstructs the function

f

. Crucially, our algorithm works in a significantly more general setting than all related prior results. We allow for essentially arbitrary superpositions, only requiring that

v_i, v_j

are not nearly identical for

i eq j

, and allowing for general response functions

\sigma_i

. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions

v_i

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck

Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets

Dev.to

[P] Federated Adversarial Learning

Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility

Towards Data Science

Provably Extracting the Features from a General Superposition

Key Points

Abstract

Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck

Agent Self-Discovery: How AI Agents Find Their Own Wallets

[P] Federated Adversarial Learning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer