Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
arXiv cs.CL / 5/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- Diffusion language models enable parallel, non-autoregressive generation, but they often learn less efficiently than autoregressive models when trained from scratch.
- The paper proposes an AR-to-dLM conversion approach that targets limitations in existing methods’ attention patterns and training objectives to preserve task accuracy while improving speed.
- It finds that keeping the pretrained AR weight distributions intact is crucial for effective conversion, and introduces a continuous pretraining scheme with a block-wise attention pattern that is causal across blocks yet bidirectional within each block.
- To reduce a training–test mismatch in mask token distributions, the authors introduce a position-dependent token masking strategy that masks later tokens more during training.
- Experiments yield the “Efficient-DLM” model family, where Efficient-DLM 8B reportedly achieves higher accuracy than prior AR/dLM baselines while delivering significantly higher throughput (e.g., 4.5x/2.7x faster than Dream 7B/Qwen3 4B).
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER