Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper shows that diffusion-based language models rely on a fragile safety assumption: once tokens are committed early in a monotonic denoising schedule, they are never re-evaluated.
By re-masking those early refusal tokens and adding a short affirmative prefix, the authors achieve high attack success rates (76.1% on HarmBench and 81.8% on another evaluation) against instruction-tuned models without gradients or complex search.
Experiments indicate the vulnerability is structural to the model architecture/schedule, since more sophisticated gradient-optimized perturbations (e.g., via differentiable Gumbel-softmax) actually reduce attack success.
The authors conclude that dLLM safety alignment may be adversarially shallow and depends on schedule adherence rather than robust safety mechanisms.
Proposed mitigations include safety-aware unmasking schedules, detecting step-conditional prefix manipulations, and re-verifying commitments after they are made.

Abstract

Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)

Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)

Dev.to

Why Domain Knowledge Is Critical in Healthcare Machine Learning

Dev.to

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

Key Points

Abstract

Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)

Free AI Tools With No Message Limits — The Definitive List (2026)

Why Domain Knowledge Is Critical in Healthcare Machine Learning

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer