Attention-Based Sampler for Diffusion Language Models

arXiv cs.CL / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses limitations of autoregressive decoding by studying how diffusion-based LLMs can choose decoding order beyond token-level signals.
  • It provides a theoretical result that approximately maximizes sequence log-likelihood by decoding tokens in descending order of attention-matrix column sums.
  • Based on this theory, the authors introduce Attn-Sampler, a training-free attention-guided decoding algorithm intended to improve generation quality over greedy approaches.
  • To make the method practical and faster, they propose a block attention approximation and dynamic attention thresholding to accelerate decoding while preserving benefits.
  • Experiments on multiple benchmarks show improved generation quality and increased decoding parallelism compared with existing decoding strategies.

Abstract

Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential decoding paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel decoding and flexible language modeling. Despite these advantages, current dLLMs decoding strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the decoding order selection problem from the perspective of log-likelihood maximization. We theoretically demonstrate that optimal sequence likelihood can be approximately achieved by decoding tokens in descending order of their attention matrix column sums. This finding provides a principled justification for attention-guided decoding and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free decoding algorithm, termed Attn-Sampler, and further propose a block attention approximation and dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the decoding parallelism.