Improving Robustness In Sparse Autoencoders via Masked Regularization

arXiv cs.LG / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that sparse autoencoders (SAEs) can be brittle and prone to feature absorption, meaning interpretability degrades even when reconstruction quality is high.
It highlights evidence that SAEs can also fail more broadly under out-of-distribution (OOD) conditions, suggesting existing training objectives are under-specified for robustness.
The authors propose masking-based regularization that randomly replaces tokens during training to break harmful co-occurrence patterns.
Experiments show improved robustness across different SAE architectures and sparsity settings, with reduced absorption and stronger probing performance.
The approach narrows the SAE OOD performance gap, supporting a practical route toward more reliable mechanistic interpretability tooling.

Abstract

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.