Sound Agentic Science Requires Adversarial Experiments

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM-based agents speeding up scientific data analysis can also amplify a key failure mode: producing many plausible but inadequately tested claims through selectively run analyses.
  • It emphasizes that, unlike software, scientific knowledge cannot be validated merely by iterative code accumulation or after-the-fact statistical justification.
  • The authors note that “proof” from a single fluent explanation or significant result is not true verification because falsifying evidence may be missing or unpublished.
  • They propose a falsification-first evaluation standard for agentic, non-experimental claims, requiring agents to actively look for ways the claims could fail rather than optimize for persuasive narratives.

Abstract

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.