Transformers Learn the Optimal DDPM Denoiser for Multi-Token GMMs
arXiv cs.LG / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces the first convergence analysis for training transformer-based diffusion (DDPM) models, addressing why they can achieve effective score matching despite non-convex training dynamics.
- It studies a population DDPM objective under data drawn from a multi-token Gaussian mixture distribution and derives theoretical requirements on token count per datapoint and the number of training iterations to reach global convergence.
- The analysis quantifies convergence to a Bayes-optimal risk for the denoising objective, linking training progress to a target score matching error.
- The authors find that a key role of the transformer’s self-attention is to realize a mean denoising mechanism approximating the oracle MMSE estimator of the diffusion-step noise.
- Numerical experiments are reported to validate the theoretical convergence claims and the proposed interpretation of what self-attention is doing in denoising.



