Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

arXiv cs.CL / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key limitation of discrete diffusion language models (dLLMs): parallel unmasking can cause distributional mismatch because the method uses a factorized product of per-token marginals instead of the true joint conditional.
It introduces DEMASK (DEpendency-guided unMASKing), which adds a lightweight dependency predictor on top of a dLLM to estimate pairwise conditional influences among masked positions in a single forward pass.
DEMASK uses these dependency estimates with a greedy bounded-dependency selection strategy to decide which tokens to unmask simultaneously, aiming to reduce the gap from the model’s true joint distribution.
The authors provide a theoretical guarantee (under a sub-additivity assumption) that the proposed selection bounds the total variation distance between the parallel sampling distribution and the model’s joint.
Experiments on Dream-7B show a 1.7–2.2× speedup while matching or improving accuracy versus confidence-based and KL-based parallel decoding baselines.

Abstract

Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption, we prove this bounds the total variation distance between our parallel sampling and the model's joint. Empirically, DEMASK achieves 1.7-2.2