DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • この論文は、Diffusion Large Language Models(dLLM)におけるジャイルブレイク脆弱性が、自動回帰LLMと異なる生成メカニズム(反復・並列生成)に起因することを分析しています。
  • 実験により、標準的なgreedy remasking戦略に潜む有害バイアスと、初期段階トークンの安全性が最終出力を左右する「Denoising-path Dependence」という現象を特定しています。
  • 併せて、現行のデコーディング戦略が主要な脆弱性である一方で、dLLMには固有の安全性ポテンシャルがあることを示し、その活用のための学習不要の防御策DiffuGuardを提案しています。
  • DiffuGuardは、Stochastic Annealing Remaskingでgreedy由来の偏りを抑えるとともに、Block-level Audit and Repairで内部表現を用いたリスク検知と修正を行い、4つのdLLMで6種類のジャイルブレイクに対するAttack Success Rateを47.9%から14.7%へ大幅に低減しつつ有用性と効率を維持したと報告しています。

Abstract

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.