AI scientists produce results without reasoning scientifically

arXiv cs.AI / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates LLM-based scientific agents in eight domains using 25,000+ runs, analyzing both performance and the epistemic structure of their reasoning.
  • Results show the base language model dominates agent behavior and outcomes (41.4% explained variance) while the agent scaffold contributes far less (1.5%).
  • In 68% of reasoning traces, the agents ignore evidence, and only in 26% do they revise beliefs based on refutation; convergent multi-test evidence is rare.
  • The same unreliable reasoning pattern appears across different modes (workflow execution vs. hypothesis-driven inquiry) and persists even when agents are given successful reasoning trajectories as context.
  • The authors conclude that outcome-based evaluation and scaffold engineering alone cannot ensure scientifically justified results; reasoning quality itself must become a training target.

Abstract

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.