Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

arXiv cs.CL / 4/22/2026

📰 NewsModels & Research

共有:

Key Points

The paper analyzes masked diffusion language models that use Token-to-Token (T2T) editing to overwrite confidently wrong tokens, and identifies three structural failure modes in this rule.
It proposes Token-to-Mask (T2M) remasking, which resets a suspect position back to the mask state so the model can re-predict it during the next denoising step using an in-distribution context.
T2M is training-free, changes only the editing procedure, adds no new parameters, and includes three detection heuristics to decide when to trigger remasking.
Experiments across eight benchmarks show T2M improves exact token-level accuracy, with its biggest gain being +5.92 points on CMATH by repairing a substantial share of “last-mile” corrupted final answers.
The authors also provide a theoretical rationale for why a mask conditioning signal is more effective than conditioning on an erroneous committed token.

Abstract

Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Dev.to

Salesforce Headless 360: Run Your CRM Without a Browser

Dev.to

RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to

We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

Dev.to

gpt-image-2 API: ship 2K AI images in Next.js for $0.21 (2026)

Dev.to

Remask, Don't Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

Key Points

Abstract

Related Articles

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Salesforce Headless 360: Run Your CRM Without a Browser

RAG Systems in Production: Building Enterprise Knowledge Search

We Built a 31-Agent AI Team That Hires Itself, Critiques Itself, and Dreams

gpt-image-2 API: ship 2K AI images in Next.js for $0.21 (2026)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer