Nonstandard Errors in AI Agents
arXiv cs.AI / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study deployed 150 autonomous Claude Code agents to independently test six hypotheses about market-quality trends in NYSE TAQ data for SPY from 2015 to 2024.
- It finds sizable nonstandard errors, with agent-to-agent variation in analytical choices such as measure selection (autocorrelation versus variance ratio) and dollars versus shares.
- Different model families (Sonnet 4.6 vs Opus 4.6) exhibit stable empirical styles, indicating systematic methodological preferences across agents.
- In a three-stage feedback protocol, AI peer review has minimal effect on dispersion, while exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80-99% within converging measure families.
- Convergence occurs via within-family estimation tightening and occasional switching of measure families, but it reflects imitation rather than understanding, with implications for automated policy evaluation and empirical research.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA