Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

arXiv cs.LG / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates whether LLM agents autonomously perform genuine architecture search or merely tune hyperparameters within a narrow region of the design space, analyzing 10,469 experiments by Claude Opus and Gemini 2.5 Pro across 108,000 discrete cells for dashcam collision detection over 27 days.
An ANOVA decomposition shows architectural choices explain 94% of performance variance (F = 1324, eta^2 = 0.94), while hyperparameter variation within a fixed architecture accounts for only about 6%.
Cross-task validation on a second collision dataset confirms the architecture-discovery result (75% architecture-explained variance) and identifies a different winning backbone, supporting genuine architecture search by the agents.
The agents discover that V-JEPA2 video features with Zipformer temporal encoders achieve 0.9245 AP—an configuration no human proposed—and, at N = 50, LLM-guided search reaches AP = 0.985 versus 0.965 for from-scratch random search.
Post-bugfix convergence follows a power law (c = 0.11, R^2 = 0.93); the small exponent indicates the cost of broad exploration rather than inefficiency, and the work introduces a large-scale empirical framework for LLM-guided combinatorial ML experimentation, using entropy cycles and Jensen-Shannon specialization to characterize multi-agent search dynamics.

Abstract

When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} (

F = 1324

\eta^2 = 0.94

), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at

N = 50

, LLM-guided search reaches AP

= 0.985

versus

0.965

for from-scratch random search. Post-bugfix convergence follows a power law (

c = 0.11

R^2 = 0.93

); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer