EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces
arXiv cs.CL / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- EarlySciRev is a new dataset that extracts early-stage paragraph-level scientific text revisions from arXiv LaTeX sources using authors’ commented-out drafting traces.
- The method aligns commented LaTeX segments with nearby final text to form candidate revision pairs, then uses LLM-based filtering to keep revisions that reflect genuine author changes.
- From 1.28M initial candidate pairs, the pipeline validates 578k revision pairs, providing grounding in authentic early writing behavior rather than only final or near-final paper versions.
- The release also includes a human-annotated benchmark for revision detection, aiming to support empirical study of revision dynamics and evaluation of LLMs for scientific writing.
- The authors position EarlySciRev as complementary to existing datasets focused on late-stage revisions or synthetic rewrites, enabling research on revision modeling and LLM-assisted editing workflows.



