Attention Is Where You Attack
arXiv cs.AI / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces the Attention Redistribution Attack (ARA), a white-box method that bypasses safety-aligned LLMs by redirecting attention away from safety-critical positions using nonsemantic adversarial tokens.
- Instead of attacking model outputs or logits like many prior jailbreak approaches, ARA manipulates the geometry of softmax attention on the probability simplex by optimizing token choices over targeted attention heads via Gumbel-softmax.
- Experiments on LLaMA-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Gemma-2-9B-it show ARA can achieve substantial attack success rates (e.g., up to 36% ASR on Mistral-7B and 30% on LLaMA-3) with very few tokens and limited optimization steps.
- The study finds a key mechanistic dissociation: ablating (zeroing) the most important safety heads causes minimal refusal flips, while attention redistribution in the same safety-heavy layers causes many more refusals to be overturned, implying safety behavior arises from attention routing rather than removable head modules.
- Gemma-2 demonstrates strong resistance in the reported results, with ARA success staying at about 1%, highlighting model-dependent differences in safety mechanism vulnerability.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to