Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

arXiv cs.AI / 4/27/2026

📰 NewsModels & Research

Key Points

  • The paper explores whether LLM-based agents can reproduce social-science findings using only a paper’s methods description and the original data, without access to the original code or the paper itself beyond the extracted methods.
  • It introduces an agentic reproduction system that converts methods text into structured instructions, runs reimplementations under strict information isolation, and performs deterministic, cell-level comparisons between reproduced outputs and the published results.
  • The system includes an error-attribution step that traces discrepancies across the agent’s pipeline to identify likely root causes of reproduction failures.
  • Experiments across four agent scaffolds and four LLMs on 48 human-verified reproducible papers show that agents can often recover published results, but success rates vary widely by model, scaffold, and paper.
  • Root-cause analysis indicates that failures arise from both agent-specific mistakes and from missing or ambiguous details (underspecification) in the papers’ methods descriptions.

Abstract

Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.