Mask Is What DLLM Needs: A Masked Data Training Paradigm for Diffusion LLMs
arXiv cs.LG / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an Information Density Driven Smart Noise Scheduler for diffusion language models to address non-uniform information density in real-world sequences.
- It introduces Complementary Priority Masking to decouple a training instance into mutually reinforcing reasoning and syntax samples, enabling the model to master both logical deduction and foundational sequence structure.
- Experiments show an average ~4% accuracy improvement across four Code and Math reasoning benchmarks, outperforming uniform baselines.
- Mechanistic analyses reveal that probabilistic priority masking mitigates contextual collapse during block diffusion training, and the processed dataset is available at https://huggingface.co/datasets/malr07/opc-sft-stage2-dense-extracted.




