EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces EditPropBench, a benchmark designed to evaluate how well LLM-based editors propagate a factual change through dependent claims in scientific manuscripts.
  • Each benchmark item uses a synthetic, ML/NLP-style manuscript paired with a targeted edit and a fact graph annotated at the sentence level to distinguish direct targets, required downstream updates, and protected unrelated text.
  • Across the benchmark’s more difficult implicit/free-form settings, results for five LLM editing systems vary substantially (ERA 0.148–0.705), and even the best system fails to capture about 30% of required cascade updates.
  • Stress tests and metric analyses suggest LLM editors can outperform deterministic substitution baselines when easier, substitution-solvable cases are included, but reliable revision still needs cascade-aware verification.
  • An audit of recent arXiv cs.CL papers finds fact-dependent qualitative claims appear in 37.2% of papers, underscoring the practical need for tools that handle non-local edit implications.

Abstract

Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as 'medium-scale' or 'a few hundred items' may also become stale, even though they do not repeat the edited number. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and protected unrelated text. EditPropBench provides a controlled manuscript-level benchmark with sentence-level dependency supervision, three editing protocols, adversarial metric probes, stress-test variants, and a metric suite centered on Edit-Ripple Adherence (ERA). On the hard implicit/free-form stratum, five LLM editing systems span ERA 0.148--0.705; even the strongest misses roughly 30% of required cascade updates. A mixed-stratum stress test shows that LLMs retain a positive advantage over deterministic substitution baselines when easy substitution-solvable cases are included. Finally, an audit of recent arXiv cs.CL benchmark and dataset papers finds fact-dependent qualitative claims in 37.2% of papers. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.