Nonstandard Errors in AI Agents

arXiv cs.AI / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study deployed 150 autonomous Claude Code agents to independently test six hypotheses about market-quality trends in NYSE TAQ data for SPY from 2015 to 2024.
It finds sizable nonstandard errors, with agent-to-agent variation in analytical choices such as measure selection (autocorrelation versus variance ratio) and dollars versus shares.
Different model families (Sonnet 4.6 vs Opus 4.6) exhibit stable empirical styles, indicating systematic methodological preferences across agents.
In a three-stage feedback protocol, AI peer review has minimal effect on dispersion, while exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80-99% within converging measure families.
Convergence occurs via within-family estimation tightening and occasional switching of measure families, but it reflects imitation rather than understanding, with implications for automated policy evaluation and empirical research.

Abstract

We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Nonstandard Errors in AI Agents

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer