When Models Don't Collapse: On the Consistency of Iterative MLE

arXiv stat.ML / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates “model collapse” in iterative generative modeling where each new model is trained on a mixture of real data and synthetic data produced by earlier model generations.
  • It provides theoretical, non-asymptotic guarantees that under standard MLE assumptions, collapse can be avoided even when the fraction of real data decreases to (effectively) zero.
  • The authors also show that additional assumptions beyond basic MLE consistency are necessary, because removing them can allow collapse to occur arbitrarily quickly despite the original real data remaining in the training set.
  • The work argues these results are among the first rigorous examples analyzing iterative training with accumulating synthetic data and explicitly characterizing when collapse can or cannot be prevented.

Abstract

The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.