EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

arXiv cs.CL / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

EarlySciRev is a new dataset that extracts early-stage paragraph-level scientific text revisions from arXiv LaTeX sources using authors’ commented-out drafting traces.
The method aligns commented LaTeX segments with nearby final text to form candidate revision pairs, then uses LLM-based filtering to keep revisions that reflect genuine author changes.
From 1.28M initial candidate pairs, the pipeline validates 578k revision pairs, providing grounding in authentic early writing behavior rather than only final or near-final paper versions.
The release also includes a human-annotated benchmark for revision detection, aiming to support empirical study of revision dynamics and evaluation of LLMs for scientific writing.
The authors position EarlySciRev as complementary to existing datasets focused on late-stage revisions or synthetic rewrites, enabling research on revision modeling and LLM-assisted editing workflows.

Abstract

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.