Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

arXiv cs.LG / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces “autoresearch,” an LLM-agent approach that optimizes hyperparameters by directly editing training source code in an unconstrained search space, and uses it as a testbed against classical HPO methods.
Under a fixed, constrained hyperparameter search space, classical algorithms like CMA-ES and TPE consistently outperform LLM-based agents for tuning a small language model.
In the unconstrained setting, LLM-based code editing substantially narrows the performance gap, and the study finds that avoiding out-of-memory failures is more important than maximizing search diversity.
The authors argue that small/mid-sized LLMs struggle to maintain optimization state across trials, while classical HPO methods lack domain knowledge, motivating a hybrid solution.
They propose “Centaur,” which combines CMA-ES internal state sharing (mean vector, step size, covariance) with an LLM, and report that the best results come from a Centaur 0.8B variant, while scaling to 27B shows no advantage for fixed-space methods with the tested open-weight models.

Abstract

The autoresearch repository enables an LLM agent to search for optimal hyperparameter configurations on an unconstrained search space by editing the training code directly. Given a fixed compute budget and constraints, we use \emph{autoresearch} as a testbed to compare classical hyperparameter optimization (HPO) algorithms against LLM-based methods on tuning the hyperparameters of a small language model. Within a fixed hyperparameter search space, classical HPO methods such as CMA-ES and TPE consistently outperform LLM-based agents. However, an LLM agent that directly edits training source code in an unconstrained search space narrows the gap to classical methods substantially despite using only a self-hosted open-weight 27B model. Methods that avoid out-of-memory failures outperform those with higher search diversity, suggesting that reliability matters more than exploration breadth. While small and mid-sized LLMs struggle to track optimization state across trials, classical methods lack domain knowledge. To bridge this gap, we introduce Centaur, a hybrid that shares CMA-ES's internal state, including mean vector, step-size, and covariance matrix, with an LLM. Centaur achieves the best result in our experiments, with its 0.8B variant outperforming the 27B variant, suggesting that a cheap LLM suffices when paired with a strong classical optimizer. The 0.8B model is insufficient for unconstrained code editing but sufficient for hybrid optimization, while scaling to 27B provides no advantage for fixed search space methods with the open-weight models tested. Code is available at https://github.com/ferreirafabio/autoresearch-automl.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

We built a 9-item checklist that catches LLM coding agent failures before execution starts

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB

Dev.to

Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

We built a 9-item checklist that catches LLM coding agent failures before execution starts

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer