Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that diffusion-based language models rely on a fragile safety assumption: once tokens are committed early in a monotonic denoising schedule, they are never re-evaluated.
  • By re-masking those early refusal tokens and adding a short affirmative prefix, the authors achieve high attack success rates (76.1% on HarmBench and 81.8% on another evaluation) against instruction-tuned models without gradients or complex search.
  • Experiments indicate the vulnerability is structural to the model architecture/schedule, since more sophisticated gradient-optimized perturbations (e.g., via differentiable Gumbel-softmax) actually reduce attack success.
  • The authors conclude that dLLM safety alignment may be adversarially shallow and depends on schedule adherence rather than robust safety mechanisms.
  • Proposed mitigations include safety-aware unmasking schedules, detecting step-conditional prefix manipulations, and re-verifying commitments after they are made.

Abstract

Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because the denoising schedule is never violated. We discuss defenses including safety-aware unmasking schedules, step-conditional prefix detection, and post-commitment re-verification.