RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets

arXiv cs.AI / 4/7/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper addresses how to choose the best optimization algorithm for LLM-guided agent evolution when evaluations are costly, using a fixed budget of 1,500 evaluations.
  • It presents the first systematic comparison of three paradigms—RoboPhD’s Elo tournament selection, GEPA’s Pareto-based selection, and Autoresearch’s greedy hill-climbing—across four benchmarks (abstract reasoning, cloud scheduling, SQL generation, and financial QA).
  • RoboPhD’s key contribution is “validation-free evolution,” using Elo competition on training data to both assess agent quality and drive the evolutionary process within the same budget.
  • Across benchmarks, a single default configuration lets RoboPhD outperform GEPA and Autoresearch on three of four tasks, with a notable ARC-AGI improvement from 27.8% to 65.8% using Gemini 3.1 Flash Lite.
  • The authors release RoboPhD as an MIT-licensed toolkit with an optimize_anything() API to evolve diverse complex agents via self-instrumenting diagnostic code growth.

Abstract

2026 has brought an explosion of interest in LLM-guided evolution of agentic artifacts, with systems like GEPA and Autoresearch demonstrating that LLMs can iteratively improve prompts, code, and agent architectures across diverse domains. As adoption accelerates, a central question emerges: given the same information, the same seed agent, and the same objective, which optimization algorithm yields the best results under the same evaluation budget? This question becomes critical when evaluations are expensive, such as when they require human judgment or multiple LLM calls. We present the first systematic comparison of three optimization paradigms -- Elo tournament selection (RoboPhD), Pareto-based selection (GEPA), and greedy hill-climbing (Autoresearch) -- across four benchmarks spanning abstract reasoning, cloud scheduling, SQL generation, and financial QA, all under a fixed budget of 1,500 evaluations. RoboPhD introduces validation-free evolution: instead of splitting the budget between training and validation, it uses Elo competition on training data to simultaneously evaluate agents and drive evolution. All three systems receive seed agents with diagnostic print() statements that evolution can grow, enabling self-instrumenting agents that develop increasingly informative diagnostics for the benefit of their evolutionary successors. Using a single default configuration, RoboPhD outperforms both GEPA and Autoresearch on three of four benchmarks, losing only on the simplest task, where the winning solution (from our Autoresearch adaptation) required under 90 lines of code. On ARC-AGI, RoboPhD evolves a 22-line seed agent into a 1,013-line multi-strategy system, improving accuracy from 27.8% to 65.8% using Gemini 3.1 Flash Lite as the solver. We release RoboPhD as a versatile toolkit under the MIT license with a simple optimize_anything() API for evolving diverse complex agents.