The Linear Centroids Hypothesis: How Deep Network Features Represent Data
arXiv cs.LG / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes the Linear Centroids Hypothesis (LCH) to improve interpretability over the Linear Representation Hypothesis by characterizing features using linear directions of input-space centroids rather than latent activations alone.
- LCH defines centroids as vector summaries of a deep network’s local functional behavior, aiming to avoid LRH’s limitations such as ignoring neuron/layer components and being vulnerable to spurious features.
- The authors show that LCH-based interpretability can reuse existing LRH tooling (e.g., sparse autoencoders) by applying sparse feature learning to centroids instead of raw latent activations.
- Experiments indicate that for DINO vision transformers, using centroids produces sparser feature dictionaries and also improves performance on downstream tasks.
- The framework extends beyond vision models, suggesting that LCH can identify circuits in GPT-2 Large and includes released code to study the hypothesis.
Related Articles

Black Hat Asia
AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking
Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance
Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning