Drift and selection in LLM text ecosystems

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how public text ecosystems evolve when model-generated outputs are repeatedly re-ingested and learned from by later agents, creating a recursive feedback loop.
It introduces an exactly solvable mathematical framework using variable-order n-gram agents to separate two mechanisms: drift (loss of rare forms from unfiltered reuse) and selection (filtering caused by publication, ranking, and verification).
The authors characterize stable corpus distributions in the infinite-corpus limit, showing that unfiltered reuse drives convergence toward a shallow state where additional lookahead yields little benefit.
When selection is normative—favoring quality, correctness, or novelty—the system maintains richer structure, and the paper derives an optimal upper bound on how far the resulting dynamics can diverge from shallow equilibria.
The framework provides guidance for designing AI training corpora by identifying conditions under which recursive publication compresses text diversity versus conditions that preserve structure.

Abstract

The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order

n

-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.