Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why jailbreaks succeed in an aligned LLM by testing whether specific internal features (not just prompts) drive harmful output generation.
Using a three-stage pipeline on Gemma-2-2B and the BeaverTails dataset, the authors extract concept-aligned tokens from adversarial responses and locate related SAE feature subgroups across all 26 layers.
The researchers then “mechanistically steer” the model by amplifying the most important features from each identified subgroup and evaluate the resulting changes in harmfulness with an LLM-judge scoring protocol.
Across multiple feature-grouping strategies (clustering, hierarchical linkage, and single-token-driven selection), layers 16–25 show the greatest vulnerability, indicating that mid-to-late layer features are more responsible for unsafe outputs.
The findings suggest jailbreak vulnerability is localized to mid-to-late layer feature subgroups, implying targeted feature-level defenses could be more principled than prompt-only safety measures.

Abstract

Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer