MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

arXiv cs.LG / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MetaSAEs, a joint-training method for sparse autoencoders that adds a “decomposability penalty” to discourage latent/subspace blending so features become more atomic and concept-specific.
MetaSAEs train a small “meta SAE” alongside a primary SAE; the primary SAE is penalized when its decoder directions can be sparsely reconstructed from the meta dictionary, creating gradient pressure toward mutually independent decoder directions.
Experiments on GPT-2 Large (layer 20) show a 7.5% reduction in mean |φ| versus an otherwise identical solo SAE and a 7.6% improvement in automated interpretability (fuzzing) scores, suggesting improved atomicity with modest reconstruction overhead.
Results on Gemma 2 9B are described as directional, with the method also performing best on not-fully-converged SAEs (up to +8.6% ΔFuzz), and qualitative analysis indicates polysemantic features are split into semantically distinct sub-features tied to different representational subspaces.

Abstract

Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation. We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE's decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions lie in a subspace spanned by other primary directions. This creates gradient pressure toward more mutually independent decoder directions that resist sparse meta-compression. On GPT-2 large (layer 20), the selected configuration reduces mean

|\varphi|

by 7.5% relative to an identical solo SAE trained on the same data. Automated interpretability (fuzzing) scores improve by 7.6%, providing external validation of the atomicity gain independent of the training and co-occurrence metrics. Reconstruction overhead is modest. Results on Gemma 2 9B are directional. On not-fully-converged SAEs, the same parameterization yields the best results, a

+8.6\%

\Delta

Fuzz. Though directional, this is an encouraging sign that the method transfers to a larger model. Qualitative analysis confirms that features firing on polysemantic tokens are split into semantically distinct sub-features, each specializing in a distinct representational subspace.

Black Hat Asia

AI Business

v0.20.5

Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

Dev.to

MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

Key Points

Abstract

Related Articles

Black Hat Asia

v0.20.5

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer