Towards Understanding the Robustness of Sparse Autoencoders

arXiv cs.AI / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study investigates whether Sparse Autoencoders (SAEs) improve defenses against optimization-based prompt jailbreak attacks that target LLM internal gradients.
  • By injecting pretrained SAEs into transformer residual streams at inference time (without changing model weights or blocking gradients), the authors find up to a 5× reduction in jailbreak success across multiple model families.
  • SAE augmentation also lowers cross-model attack transferability, making jailbreak methods less reusable against different LLMs.
  • Parametric ablations show a monotonic dose-response effect between SAE sparsity (L0) and attack success, alongside a layer-dependent tradeoff between robustness and clean performance.
  • The results support a representational bottleneck explanation: sparse projections alter the optimization geometry that jailbreak attacks exploit.

Abstract

Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.