RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets
arXiv cs.AI / 4/7/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper addresses how to choose the best optimization algorithm for LLM-guided agent evolution when evaluations are costly, using a fixed budget of 1,500 evaluations.
- It presents the first systematic comparison of three paradigms—RoboPhD’s Elo tournament selection, GEPA’s Pareto-based selection, and Autoresearch’s greedy hill-climbing—across four benchmarks (abstract reasoning, cloud scheduling, SQL generation, and financial QA).
- RoboPhD’s key contribution is “validation-free evolution,” using Elo competition on training data to both assess agent quality and drive the evolutionary process within the same budget.
- Across benchmarks, a single default configuration lets RoboPhD outperform GEPA and Autoresearch on three of four tasks, with a notable ARC-AGI improvement from 27.8% to 65.8% using Gemini 3.1 Flash Lite.
- The authors release RoboPhD as an MIT-licensed toolkit with an optimize_anything() API to evolve diverse complex agents via self-instrumenting diagnostic code growth.



