PRBench: End-to-end Paper Reproduction in Physics Research
arXiv cs.CL / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- PRBench is introduced as a benchmark of 30 expert-curated physics paper reproduction tasks across 11 subfields, requiring agents to implement algorithms from scratch and reproduce quantitative results.
- Each task gives agents only the paper content plus instructions, and they must run in a sandboxed environment while matching validated ground-truth outcomes with detailed scoring rubrics.
- Evaluation of coding agents via an agentified assessment pipeline finds that the top system (OpenAI Codex using GPT-5.3-Codex) reaches only a 34% mean overall score, indicating limited reliability for end-to-end reproduction.
- All tested agents show a zero end-to-end callback success rate, with especially poor performance in data accuracy and code correctness.
- The study identifies recurring failure modes such as incorrect formula-to-code implementation, inability to debug numerical simulations, and even fabrication of output data.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to