Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value

arXiv stat.ML / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that diffusion model loss can’t be used reliably as a measure of absolute data fit because the optimal (best-achievable) loss is usually non-zero and unknown.
  • It derives the optimal loss for diffusion models in closed form under a unified formulation and proposes practical estimators, including a stochastic version that scales to large datasets with controlled variance and bias.
  • The authors use the estimated optimal loss as a diagnostic metric to better assess training quality across mainstream diffusion model variants.
  • They also improve training schedules by optimizing with respect to the estimated optimal loss and report that, for 120M–1.5B parameter models, clearer power-law behavior emerges after subtracting the optimal loss from observed training loss, informing scaling-law studies.
  • Overall, the work provides a more principled framework for evaluating and comparing diffusion model training progress beyond raw loss values.

Abstract

Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.