Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
arXiv cs.LG / 2026/4/6
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper studies masked diffusion language models (MDLMs), focusing on speeding up sampling that currently requires many full-sequence denoising passes through a large Transformer.
- It proposes “model scheduling,” using a smaller MDLM to replace the full model at selected denoising steps to reduce compute while preserving quality.
- Experiments on OpenWebText show early and late denoising steps are more robust to small-model replacement than middle steps, enabling up to a 17% FLOPs reduction with only modest loss in generative perplexity.
- The authors back these results with step-importance analyses (loss and KL divergence across timesteps) and an exhaustive search over coarse step segments, concluding the middle of the diffusion trajectory is most sensitive.
- Overall, the work suggests architecture-agnostic scheduling rules can accelerate MDLM inference without substantially harming generation quality as measured by perplexity.




