Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings
arXiv cs.CL / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why jailbreaks succeed in an aligned LLM by testing whether specific internal features (not just prompts) drive harmful output generation.
- Using a three-stage pipeline on Gemma-2-2B and the BeaverTails dataset, the authors extract concept-aligned tokens from adversarial responses and locate related SAE feature subgroups across all 26 layers.
- The researchers then “mechanistically steer” the model by amplifying the most important features from each identified subgroup and evaluate the resulting changes in harmfulness with an LLM-judge scoring protocol.
- Across multiple feature-grouping strategies (clustering, hierarchical linkage, and single-token-driven selection), layers 16–25 show the greatest vulnerability, indicating that mid-to-late layer features are more responsible for unsafe outputs.
- The findings suggest jailbreak vulnerability is localized to mid-to-late layer feature subgroups, implying targeted feature-level defenses could be more principled than prompt-only safety measures.
Related Articles
LLMs will be a commodity
Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring
Dev.to