Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper provides a controlled empirical comparison of autoregressive (AR) versus masked diffusion language models by holding data, compute, sequence length, and hardware constant while varying only the generation paradigm.
  • It finds similar training throughput for both approaches, with MDLM taking only about 4.7% more wall-clock time, indicating no major efficiency disadvantage in training speed.
  • The study reports different convergence and overfitting behaviors: AR converges faster but begins overfitting around step 14,000, while MDLM continues improving through step 20,000.
  • A diversity analysis over 1,000 generated samples shows a structured trade-off: AR outputs are more fluent but less diverse, whereas MDLM produces more diverse narratives with occasional grammatical inconsistencies.
  • The authors release code, trained checkpoints, and data pipelines to support reproducibility and further investigation.

Abstract

We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.