Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
arXiv cs.LG / 2026/4/6
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper argues that diffusion language models’ greater flexibility has outpaced current evaluation practices, creating new challenges for making reliable model comparisons.
- It critiques the common reliance on OpenWebText as a benchmark and explains why alternatives like LM1B may be inherently less meaningful for evaluation.
- The authors show that likelihood-based evaluation is limited for diffusion models and that “generative perplexity” by itself can yield uninformative results because it correlates with entropy effects.
- By decomposing generative perplexity and entropy into components of KL divergence to a reference distribution, the paper motivates a more principled evaluation approach called “generative frontiers.”
- The work includes empirical observations at the GPT-2-small (≈150M parameters) starting scale and provides interactive blog materials to demonstrate the methodology.




