Finding Belief Geometries with Sparse Autoencoders

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles mechanistic interpretability by asking whether large language models encode “belief states” as simplex-shaped geometric structures in internal representations, which had been shown in transformers trained on hidden Markov model data.
  • It proposes a pipeline that uses sparse autoencoders, k-subspace clustering over SAE features, and simplex fitting (via AANet) to discover candidate simplex-structured subspaces in transformer representations.
  • The authors first validate the approach on a transformer trained with a multipartite hidden Markov model where the belief-state geometry is known, then apply it to Gemma-2-9B to find 13 priority clusters with candidate simplex geometry (K≥3).
  • To separate genuine belief-state encoding from spurious “tiling” artifacts, they use barycentric prediction as a discriminating test, finding 5 real clusters that pass at least one split while no null clusters do.
  • One identified cluster (768_596) achieves the highest causal steering score and is the only one where passive prediction and active intervention agree, but the authors frame the results as preliminary and call for a more structured confirmation protocol.

Abstract

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), k-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry (K \geq 3). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon p < 10^{-14}) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B's representation space, and identify the structured evaluation that would be required to confirm this interpretation.