AI scientists produce results without reasoning scientifically
arXiv cs.AI / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper evaluates LLM-based scientific agents in eight domains using 25,000+ runs, analyzing both performance and the epistemic structure of their reasoning.
- Results show the base language model dominates agent behavior and outcomes (41.4% explained variance) while the agent scaffold contributes far less (1.5%).
- In 68% of reasoning traces, the agents ignore evidence, and only in 26% do they revise beliefs based on refutation; convergent multi-test evidence is rare.
- The same unreliable reasoning pattern appears across different modes (workflow execution vs. hypothesis-driven inquiry) and persists even when agents are given successful reasoning trajectories as context.
- The authors conclude that outcome-based evaluation and scaffold engineering alone cannot ensure scientifically justified results; reasoning quality itself must become a training target.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.


