One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
arXiv cs.CL / 4/30/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Incremental Completion Decomposition (ICD), a jailbreak approach that extracts a malicious response from an LLM by requesting a sequence of one-word continuations before the full answer.
- ICD includes multiple variants—such as manually selecting the next word, having the model generate it, and pre-filling the final response step—while aiming to improve reliability of the attack.
- Across several model families, the authors report higher Attack Success Rate (ASR) on benchmarks like AdvBench, JailbreakBench, and StrongREJECT compared with prior methods.
- The work provides both theoretical reasoning and mechanistic evidence, suggesting that successful ICD trajectories suppress refusal-related representations and move internal activations away from safety-aligned states.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to
Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to
Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to
Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to