On the Trainability of Masked Diffusion Language Models via Blockwise Locality

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates masked diffusion language models (MDMs) against standard autoregressive LLMs on three structured tasks (linear regression, graph path-finding, and Sudoku) and finds markedly different learning stability across tasks.
  • Results show that random-masking blockwise MDMs struggle to reliably learn linear regression, have high-variance training dynamics for graph path-finding, but can outperform autoregressive models on Sudoku.
  • To address these instabilities, the authors introduce two locality-aware blockwise designs (Jigsaw and Scatter) that add left-to-right inductive bias within blocks while still using diffusion-style iterative refinement at the block level.
  • Empirically, Jigsaw improves stability to match autoregressive performance on linear regression while staying strong on Sudoku, and Scatter preserves diffusion’s planning advantage on path-finding.
  • The findings suggest that simply using random masking is a suboptimal way to instantiate diffusion LMs for ordered generation, motivating better locality-aware or non-random masking approaches.

Abstract

Masked diffusion language models (MDMs) have recently emerged as a promising alternative to standard autoregressive large language models (AR-LLMs), yet their optimization can be substantially less stable. We study blockwise MDMs and compare them with AR-LLMs on three controlled tasks that stress different aspects of structured generation: in-context linear regression, graph path-finding, and Sudoku solving. We find that standard random-masking MDMs fail to reliably learn linear regression, exhibit high variance training dynamics on graph path-finding, while outperforming AR-LLMs on Sudoku. To mitigate these instabilities, we propose two locality aware blockwise models, namely Jigsaw and Scatter, that inject left-to-right inductive bias by enforcing autoregressive locality within blocks while preserving iterative refinement at the block level. Empirically, Jigsaw matches AR-LLM stability on linear regression and remains strong on Sudoku, while Scatter retains diffusion's planning advantage on path-finding. Our results indicate that standard random-masking MDMs, even with blockwise variants, may be a suboptimal instantiation of diffusion LMs for ordered generation, motivating models beyond random masking.