AI scientists produce results without reasoning scientifically

arXiv cs.AI / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates LLM-based scientific agents in eight domains using 25,000+ runs, analyzing both performance and the epistemic structure of their reasoning.
Results show the base language model dominates agent behavior and outcomes (41.4% explained variance) while the agent scaffold contributes far less (1.5%).
In 68% of reasoning traces, the agents ignore evidence, and only in 26% do they revise beliefs based on refutation; convergent multi-test evidence is rare.
The same unreliable reasoning pattern appears across different modes (workflow execution vs. hypothesis-driven inquiry) and persists even when agents are given successful reasoning trajectories as context.
The authors conclude that outcome-based evaluation and scaffold engineering alone cannot ensure scientifically justified results; reasoning quality itself must become a training target.

Abstract

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/22DailyView insight →

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

Reddit r/LocalLLaMA

AI scientists produce results without reasoning scientifically

Key Points

Abstract

💡 Insights using this article

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer