Rewrite the News: Tracing Editorial Reuse Across News Agencies

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies sentence-level cross-lingual text reuse in multilingual journalism by detecting reused sentences without requiring full translations.
  • Using weak supervision and publication timestamps, it traces the earliest likely foreign source for each reused English sentence across 15 foreign agencies in seven languages.
  • Analysis of 1,037 STA and 237,551 FA articles finds substantial reuse: 52% of STA articles contain reused sentences, while reuse appears in 1.6% of FA articles.
  • The study shows that editorial reuse is mostly non-literal, often involving paraphrase and compositional reuse, and that reused material is more common in the middle and end of articles than in leads.
  • The authors release a dataset and code for automated pre-selection to reduce information overload in journalistic workflows.

Abstract

This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non-literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: https://github.com/kunturs/lrec2026-rewrite-news.