Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

arXiv cs.AI / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “data lineage” for post-training LLM datasets and proposes an automated multi-agent framework to reconstruct how datasets evolve and relate to each other over time.
  • Large-scale lineage analysis reveals domain-specific structural patterns, including vertical refinement in math-focused data and horizontal aggregation in general-domain corpora.
  • The authors identify systemic issues such as structural redundancy caused by implicit dataset overlap and the propagation of benchmark contamination along lineage paths.
  • Using the reconstructed lineage graph, they build a “lineage-aware diversity-oriented” dataset by anchoring instruction sampling to upstream root sources to reduce downstream homogenization and hidden redundancy.
  • The work argues that lineage-centric analysis is a scalable, robust topological alternative to sample-level dataset comparisons for managing large post-training data ecosystems.

Abstract

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a \textit{lineage-aware diversity-oriented dataset}. By anchoring instruction sampling at upstream root sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.