AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
arXiv cs.AI / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- AutoResearchBench is introduced as a new benchmark specifically for evaluating AI agents’ ability to discover relevant scientific literature autonomously.
- The benchmark includes two task types—Deep Research (progressively locating a target paper) and Wide Research (collecting a comprehensive set of papers that meet given conditions).
- The authors argue AutoResearchBench is distinct from prior agentic web-browsing benchmarks because it is research- and concept-focused, requires fine-grained use of detailed information, and is open-ended with an unknown number of valid papers.
- Results show even strong LLM-based agents perform poorly on these tasks (about 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research), with many baselines under 5%, highlighting the difficulty.
- The dataset, evaluation pipeline, and code are publicly released to support future research on autonomous scientific discovery.



