Can Coding Agents Reproduce Findings in Computational Materials Science?
arXiv cs.CL / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces AutoMat, a benchmark designed to test whether LLM-based coding agents can reproduce scientific claims in computational materials science, going beyond coding performance alone.
- AutoMat evaluates three linked capabilities: reconstructing underspecified procedures from limited information, using specialized toolchains, and judging whether the produced evidence actually supports a claim.
- Using claims curated from real materials science papers and testing multiple coding-agent setups across foundation models, the study finds overall reproduction success is low.
- The best-performing configuration reaches only a 54.1% success rate, with failures most common when workflows must be reconstructed from paper text and when agents deviate from or incompletely follow required methods.
- The authors position AutoMat as both a reproducibility benchmark for AI-for-science and a diagnostic tool to identify current weaknesses of agentic systems in scientific workflows.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to