Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

arXiv cs.CL / 4/22/2026

📰 NewsModels & Research

Key Points

  • The paper analyzes masked diffusion language models that use Token-to-Token (T2T) editing to overwrite confidently wrong tokens, and identifies three structural failure modes in this rule.
  • It proposes Token-to-Mask (T2M) remasking, which resets a suspect position back to the mask state so the model can re-predict it during the next denoising step using an in-distribution context.
  • T2M is training-free, changes only the editing procedure, adds no new parameters, and includes three detection heuristics to decide when to trigger remasking.
  • Experiments across eight benchmarks show T2M improves exact token-level accuracy, with its biggest gain being +5.92 points on CMATH by repairing a substantial share of “last-mile” corrupted final answers.
  • The authors also provide a theoretical rationale for why a mask conditioning signal is more effective than conditioning on an erroneous committed token.

Abstract

Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.