EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces EditPropBench, a benchmark designed to evaluate how well LLM-based editors propagate a factual change through dependent claims in scientific manuscripts.
Each benchmark item uses a synthetic, ML/NLP-style manuscript paired with a targeted edit and a fact graph annotated at the sentence level to distinguish direct targets, required downstream updates, and protected unrelated text.
Across the benchmark’s more difficult implicit/free-form settings, results for five LLM editing systems vary substantially (ERA 0.148–0.705), and even the best system fails to capture about 30% of required cascade updates.
Stress tests and metric analyses suggest LLM editors can outperform deterministic substitution baselines when easier, substitution-solvable cases are included, but reliable revision still needs cascade-aware verification.
An audit of recent arXiv cs.CL papers finds fact-dependent qualitative claims appear in 37.2% of papers, underscoring the practical need for tools that handle non-local edit implications.

Abstract

Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as 'medium-scale' or 'a few hundred items' may also become stale, even though they do not repeat the edited number. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and protected unrelated text. EditPropBench provides a controlled manuscript-level benchmark with sentence-level dependency supervision, three editing protocols, adversarial metric probes, stress-test variants, and a metric suite centered on Edit-Ripple Adherence (ERA). On the hard implicit/free-form stratum, five LLM editing systems span ERA 0.148--0.705; even the strongest misses roughly 30% of required cascade updates. A mixed-stratum stress test shows that LLMs retain a positive advantage over deterministic substitution baselines when easy substitution-solvable cases are included. Finally, an audit of recent arXiv cs.CL benchmark and dataset papers finds fact-dependent qualitative claims in 37.2% of papers. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.

The 55.6% problem: why frontier LLMs fail at embedded code

Dev.to

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Dev.to

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

Reddit r/artificial

The Transformer: The Architecture Behind Modern AI

Dev.to

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Dev.to

EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

Key Points

Abstract

Related Articles

The 55.6% problem: why frontier LLMs fail at embedded code

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

The Transformer: The Architecture Behind Modern AI

Foundational Models Defining a New Era in Vision: A Survey and Outlook

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer