Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

arXiv cs.AI / 4/27/2026

📰 NewsModels & Research

共有:

Key Points

The paper explores whether LLM-based agents can reproduce social-science findings using only a paper’s methods description and the original data, without access to the original code or the paper itself beyond the extracted methods.
It introduces an agentic reproduction system that converts methods text into structured instructions, runs reimplementations under strict information isolation, and performs deterministic, cell-level comparisons between reproduced outputs and the published results.
The system includes an error-attribution step that traces discrepancies across the agent’s pipeline to identify likely root causes of reproduction failures.
Experiments across four agent scaffolds and four LLMs on 48 human-verified reproducible papers show that agents can often recover published results, but success rates vary widely by model, scaffold, and paper.
Root-cause analysis indicates that failures arise from both agent-specific mistakes and from missing or ambiguous details (underspecification) in the papers’ methods descriptions.

Abstract

Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.

Subagents: The Building Block of Agentic AI

Dev.to

DeepSeek-V4 Models Could Change Global AI Race

AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch

Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

Dev.to

Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

Key Points

Abstract

Related Articles

Subagents: The Building Block of Agentic AI

DeepSeek-V4 Models Could Change Global AI Race

Got OpenAI's privacy filter model running on-device via ExecuTorch

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer