Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper applies Sparse Autoencoders (SAEs) to Vision Transformers (ViTs) by focusing on the [CLS] token to improve out-of-distribution (OOD) detection.
  • It introduces a Top-k SAE framework that disentangles dense [CLS] features into a structured latent space, addressing prior approaches that assume entangled representations.
  • The authors discover class-specific “Class Activation Profiles” (CAPs) where in-distribution (ID) samples maintain stable activation patterns while OOD samples systematically disrupt them.
  • They propose an OOD scoring function using the divergence of “core energy profiles,” achieving strong FPR95 performance and competitive AUROC across multiple benchmarks.
  • Overall, the work argues that sparse, disentangled SAE features provide an interpretable and robust mechanism for OOD detection in vision models.

Abstract

Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric, critical for safety-sensitive applications across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.