AI Navigate

How I Built Eval Tools for Karpathy's Autoresearch

Dev.to / 3/18/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The author built three CLIs—autojudge, autosteer, and autoevolve—to quantify autoresearch results and identify which experiments actually matter.
  • autojudge reads results.tsv and run.log, estimates a rolling noise floor, and reports a verdict with a confidence score based on Pareto efficiency of val_bpb versus memory.
  • autosteer analyzes the history of kept versus discarded experiments, groups them by category (architecture, hyperparams, optimizer, regularization), and suggests what to try next in exploit or explore modes.
  • The article describes the original problem of thousands of experiments with no reliable signal and shows how these tools provide a more robust evaluation framework.
  • Exit codes are scripting-friendly (0 keep, 1 discard, 2 retest) to enable seamless piping into experiment loops.

TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis), autosteer (what to try next), autoevolve (competing agents, cross-pollinate winners).

The problem

After running autoresearch for a week I had a TSV with thousands of rows and no idea what to trust.

The built-in keep/discard logic is: did val_bpb go down? That's it. No noise floor estimation. No way to know if a 0.02% improvement is real signal or run-to-run jitter. After 700 experiments I had 6 "improvements" and zero confidence in any of them.

The eval layer isn't there. Karpathy left it as an exercise.

What I built

autojudge

Reads results.tsv and run.log, estimates the noise floor from recent experiments, checks if the improvement is on the Pareto front (val_bpb vs memory), and returns a verdict with a confidence score.

pip install autojudge
autojudge --results results.tsv --run run.log

Output looks like:

experiment_042: STRONG_KEEP (confidence: 0.91)
  val_bpb delta: -0.0041 | noise floor: ±0.0008
  pareto status: EFFICIENT

experiment_043: RETEST (confidence: 0.44)
  val_bpb delta: -0.0009 | noise floor: ±0.0011
  delta within noise -> not enough signal

Exit codes are scripting-friendly: 0 = keep, 1 = discard, 2 = retest. You can pipe directly into your loop.

What didn't work first: I tried estimating noise floor from a single baseline run. It's too noisy itself. Needed a rolling window of recent experiments (I settled on the last 5) to get a stable estimate.

autosteer

Looks at your history of kept/discarded experiments, groups them by category (architecture, hyperparams, optimizer, regularization, etc.), and suggests what to try next.

pip install autosteer
autosteer --results results.tsv --mode exploit

Two modes:

  • exploit: you're winning in a category, suggests more variations there
  • explore: you're stuck, suggests underexplored categories
Category analysis (last 50 experiments):
  architecture:    12 tried | 8 kept (67%) | EXPLOIT
  hyperparams:     18 tried | 6 kept (33%) | NEUTRAL
  optimizer:        8 tried | 1 kept (12%) | AVOID
  regularization:   4 tried | 0 kept (0%)  | EXPLORE

Suggested next: architecture variations (high success rate)
Specific angles: attention head count, layer depth, skip connections

Caveat: suggestions are category-level, not causal. It can tell you architecture changes tend to work for your setup. It can't tell you why.

autoevolve

The experimental one. Puts multiple agents on separate git worktrees with different strategies. They compete on the same problem. Winning ideas cross-pollinate into the next generation.

pip install autoevolve
autoevolve --strategies conservative aggressive random --rounds 3

Each agent gets its own worktree and runs the standard autoresearch loop with its strategy. After each round, the best-performing config gets merged into all agents as the new baseline.

This is the least polished of the three. It works. The git worktree management is clean. The cross-pollination heuristic is simplistic, I'm picking the best single config per round rather than doing anything clever with ensembles. That's next.

Installation

pip install autojudge autosteer autoevolve

Python 3.10+, MIT license. Plugs into the standard autoresearch loop, reads results.tsv and run.log, no other dependencies on the autoresearch internals.

Repo: github.com/dean0x/autolab

What I'd do differently

The noise floor estimation in autojudge took three rewrites. My first approach (single baseline) was too noisy. My second approach (fixed window of 10) was too slow to adapt early in a run. Rolling window of 5 was the right tradeoff.

If you're using autoresearch seriously, the eval layer is where the leverage is. The overnight loop is the easy part.