Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs

arXiv cs.LG / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the first convergence analysis for training transformer-based diffusion (DDPM) models, addressing why they can achieve effective score matching despite non-convex training dynamics.
  • It studies a population DDPM objective under data drawn from a multi-token Gaussian mixture distribution and derives theoretical requirements on token count per datapoint and the number of training iterations to reach global convergence.
  • The analysis quantifies convergence to a Bayes-optimal risk for the denoising objective, linking training progress to a target score matching error.
  • The authors find that a key role of the transformer’s self-attention is to realize a mean denoising mechanism approximating the oracle MMSE estimator of the diffusion-step noise.
  • Numerical experiments are reported to validate the theoretical convergence claims and the proposed interpretation of what self-attention is doing in denoising.

Abstract

Transformer-based diffusion models have demonstrated remarkable performance at generating high-quality samples. However, our theoretical understanding of the reasons for this success remains limited. For instance, existing models are typically trained by minimizing a denoising objective, which is equivalent to fitting the score function of the training data. However, we do not know why transformer-based models can match the score function for denoising, or why gradient-based methods converge to the optimal denoising model despite the non-convex loss landscape. To the best of our knowledge, this paper provides the first convergence analysis for training transformer-based diffusion models. More specifically, we consider the population Denoising Diffusion Probabilistic Model (DDPM) objective for denoising data that follow a multi-token Gaussian mixture distribution. We theoretically quantify the required number of tokens per data point and training iterations for the global convergence towards the Bayes optimal risk of the denoising objective, thereby achieving a desired score matching error. A deeper investigation reveals that the self-attention module of the trained transformer implements a mean denoising mechanism that enables the trained model to approximate the oracle Minimum Mean Squared Error (MMSE) estimator of the injected noise in the diffusion steps. Numerical experiments validate these findings.