Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

arXiv cs.CL / 4/29/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study investigates why instruction-tuned LLMs refuse harmful prompts, using sparse autoencoders (SAEs) to analyze internal activations for two public models: Gemma-2-2B-IT and LLaMA-3.1-8B-IT.
It demonstrates causal control of the refusal behavior by searching the SAE latent space for feature sets whose ablation can flip the model’s output from refusal to harmful compliance, effectively enabling jailbreaks.
The authors propose a three-stage search pipeline: locating a refusal-mediating “direction,” greedily filtering down to a minimal feature set, and then discovering nonlinear interactions via a factorization machine.
The results reveal jailbreak-critical features and also suggest the presence of redundant features that only activate when earlier features are suppressed, pointing to more complex refusal mechanisms.
Overall, the work suggests that safety behavior can be audited and intervened with more precisely by manipulating interpretable latent representations rather than relying only on surface-level prompt handling.

Abstract

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

Dev.to

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Key Points

Abstract

💡 Insights using this article

Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Vertical SaaS for Startups 2026: Building a Niche AI-First Product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer