AI Navigate

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

arXiv cs.AI / 3/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • DataEvolve automates the evolution of data-curation strategies via iterative optimization, using category-specific loops and pools of experiences and strategies.
  • It was applied to 8 categories within a 672B-token Nemotron-CC corpus, producing Darwin-CC (504B tokens) after 30 iterations per category.
  • Training 3B models on 500B tokens with Darwin-CC yielded +3.96 points over raw data and a 44.13 average across 18 benchmarks, with notable gains on knowledge-intensive tasks like MMLU.
  • The evolved strategies converge on cleaning-focused approaches—noise removal and format normalization with domain-aware preservation—aligning with Generative Refinement principles from Part I.
  • Ablation studies show iterative evolution is essential, as optimized strategies outperform suboptimal ones by 2.93 points and demonstrate the feasibility of evolutionary design for large-scale data curation.

Abstract

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.