Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
arXiv cs.CL / 4/29/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates why instruction-tuned LLMs refuse harmful prompts, using sparse autoencoders (SAEs) to analyze internal activations for two public models: Gemma-2-2B-IT and LLaMA-3.1-8B-IT.
- It demonstrates causal control of the refusal behavior by searching the SAE latent space for feature sets whose ablation can flip the model’s output from refusal to harmful compliance, effectively enabling jailbreaks.
- The authors propose a three-stage search pipeline: locating a refusal-mediating “direction,” greedily filtering down to a minimal feature set, and then discovering nonlinear interactions via a factorization machine.
- The results reveal jailbreak-critical features and also suggest the presence of redundant features that only activate when earlier features are suppressed, pointing to more complex refusal mechanisms.
- Overall, the work suggests that safety behavior can be audited and intervened with more precisely by manipulating interpretable latent representations rather than relying only on surface-level prompt handling.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product
Dev.to