RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
arXiv cs.CL / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- RECAP proposes an agentic pipeline to elicit and verify memorized training data from LLM outputs, aiming to reveal what a model has seen.
- It uses a feedback-driven loop where an initial extraction is evaluated by a secondary language model against a reference passage, producing minimal correction hints to guide subsequent generations.
- To address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers.
- The authors evaluate RECAP on EchoTrace, a benchmark spanning over 30 full books, reporting a substantial gain in extraction quality (ROUGE-L from 0.38 to 0.47 with GPT-4.1, about a 24% increase).
- The work raises important implications for data provenance, copyright, and model governance, highlighting both auditing opportunities and security risks in LLM training data.




