AI Navigate

Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

arXiv cs.LG / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether LLM agents autonomously perform genuine architecture search or merely tune hyperparameters within a narrow region of the design space, analyzing 10,469 experiments by Claude Opus and Gemini 2.5 Pro across 108,000 discrete cells for dashcam collision detection over 27 days.
  • An ANOVA decomposition shows architectural choices explain 94% of performance variance (F = 1324, eta^2 = 0.94), while hyperparameter variation within a fixed architecture accounts for only about 6%.
  • Cross-task validation on a second collision dataset confirms the architecture-discovery result (75% architecture-explained variance) and identifies a different winning backbone, supporting genuine architecture search by the agents.
  • The agents discover that V-JEPA2 video features with Zipformer temporal encoders achieve 0.9245 AP—an configuration no human proposed—and, at N = 50, LLM-guided search reaches AP = 0.985 versus 0.965 for from-scratch random search.
  • Post-bugfix convergence follows a power law (c = 0.11, R^2 = 0.93); the small exponent indicates the cost of broad exploration rather than inefficiency, and the work introduces a large-scale empirical framework for LLM-guided combinatorial ML experimentation, using entropy cycles and Jensen-Shannon specialization to characterize multi-agent search dynamics.

Abstract

When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} (F = 1324, \eta^2 = 0.94), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at N = 50, LLM-guided search reaches AP = 0.985 versus 0.965 for from-scratch random search. Post-bugfix convergence follows a power law (c = 0.11, R^2 = 0.93); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.