EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

arXiv cs.CL / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • EarlySciRev is a new dataset that extracts early-stage paragraph-level scientific text revisions from arXiv LaTeX sources using authors’ commented-out drafting traces.
  • The method aligns commented LaTeX segments with nearby final text to form candidate revision pairs, then uses LLM-based filtering to keep revisions that reflect genuine author changes.
  • From 1.28M initial candidate pairs, the pipeline validates 578k revision pairs, providing grounding in authentic early writing behavior rather than only final or near-final paper versions.
  • The release also includes a human-annotated benchmark for revision detection, aiming to support empirical study of revision dynamics and evaluation of LLMs for scientific writing.
  • The authors position EarlySciRev as complementary to existing datasets focused on late-stage revisions or synthetic rewrites, enabling research on revision modeling and LLM-assisted editing workflows.

Abstract

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.

EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces | AI Navigate