RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

arXiv cs.CL / 3/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

RECAP proposes an agentic pipeline to elicit and verify memorized training data from LLM outputs, aiming to reveal what a model has seen.
It uses a feedback-driven loop where an initial extraction is evaluated by a secondary language model against a reference passage, producing minimal correction hints to guide subsequent generations.
To address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers.
The authors evaluate RECAP on EchoTrace, a benchmark spanning over 30 full books, reporting a substantial gain in extraction quality (ROUGE-L from 0.38 to 0.47 with GPT-4.1, about a 24% increase).
The work raises important implications for data provenance, copyright, and model governance, highlighting both auditing opportunities and security risks in LLM training data.

Abstract

If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/19WeeklyView insight →📅 3/13DailyView insight →

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

How AI-Powered Revenue Intelligence Transforms B2B Sales Teams

Dev.to

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

Key Points

Abstract

💡 Insights using this article

Related Articles

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

How AI-Powered Revenue Intelligence Transforms B2B Sales Teams

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer