Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

arXiv cs.LG / 4/29/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a benchmark focused on whether frontier coding agents can autonomously reconstruct end-to-end machine learning pipelines from minimal task descriptions, aiming to provide earlier warning signals for recursive self-improvement risks.
  • As a proof of concept, agents implemented an AlphaZero-style self-play training pipeline for Connect Four on consumer hardware within a three-hour budget, and the resulting game AIs were evaluated in a round-robin tournament against the Pascal Pons Connect Four solver.
  • In experiments across four agents (eight trials each), Claude Opus 4.7 showed strong differentiation by winning as first-mover against Pons in 7 of 8 trials, statistically outperforming other tested agents.
  • The work notes anomalous time-budget behavior in GPT-5.4, where it tended to use far less of its allocated time than peers; follow-up probes suggested this could be consistent with but not confirm diagnostic sandbagging.
  • The authors released data, code, and prompts to enable reproduction and extension of the benchmark and evaluation.

Abstract

Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4's time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver | AI Navigate