Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies non-autoregressive decoding in diffusion-based language models by analyzing inference dynamics over the diffusion time (temporal) axis to understand why decoding can fail on reasoning/planning tasks.
  • It identifies a failure mode driven by “proximity bias,” where denoising tends to focus on spatially adjacent tokens, causing spatial error propagation and making the generation trajectory overly dependent on the initial unmasking position.
  • To mitigate this, the authors propose a minimal-intervention method that improves early token selection using a lightweight planner and end-of-sequence temperature annealing.
  • Experiments on multiple reasoning and planning benchmarks show substantial improvements over existing heuristic baselines while adding little to no computational overhead.

Abstract

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.