Language Generation with Replay: A Learning-Theoretic View of Model Collapse
arXiv cs.LG / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- It introduces a replay adversary that augments the training data stream with the model's own past outputs to study model collapse in a learning-theoretic framework.
- It provides a fine-grained characterization showing that replay is harmless for uniform generation but creates separations for non-uniform generation and generation in the limit.
- The findings connect to practical mitigation strategies like data cleaning, watermarking, and output filtering, clarifying when these heuristics may fail.
- The work offers theoretical insights into the limits of current data-contamination mitigation approaches for training large language models.




