| Been running autoresearch for about a week. ~100 experiments per night on an H100. The keep rate is around 15%, which matches what Karpathy posted in his own discussion threads (#32 and #43). The problem isn't the keep/discard loop. That works. The problem is that some of those keeps don't hold up. Karpathy's session #43 shows that 5% warmup (a keep in session #32) actually hurt performance when run again. A 0.02% improvement in val_bpb could be a real win or GPU nondeterminism. After extended runs it gets worse: 68 experiments for a single keep. If you build on a false keep (change architecture based on it, stack more experiments on top), you're compounding noise. That's worse than a clean discard. So I built three CLIs: autojudge estimates noise floor from your recent experiments, checks if the result sits on the Pareto front (val_bpb vs memory), and returns a confidence scored verdict: STRONG_KEEP, KEEP, MARGINAL, RETEST, DISCARD, or CRASH. MARGINAL means "this might be noise, retest before building on it." Exit codes are scripting friendly. autosteer analyzes which categories of experiments (architecture, hyperparams, optimizer) historically produced real improvements and suggests what to try next. Exploit mode when you're on a streak, explore when you're stuck. Stops the random walk. autoevolve is more experimental. It puts multiple agents on separate git worktrees with different strategies competing on the same problem. Winning ideas get cross pollinated. The difference in practice: instead of waking up to a TSV and guessing which keeps are real, you wake up to ranked results with confidence scores and a clear next step. Caveats: noise floor estimation needs ~5 experiments to stabilize. autosteer's suggestions are category level, not causal. autoevolve is the newest and least polished. pip install autojudge autosteer autoevolve [link] [comments] |
[P] Built confidence scoring for autoresearch because keeps that don't reproduce are worse than discards
Reddit r/MachineLearning / 3/18/2026
📰 NewsTools & Practical Usage
Key Points
- The post details running autoresearch at scale (about 100 experiments per night on an H100) with a keep rate around 15%, consistent with Karpathy’s observations.
- It argues that while the keep/discard loop works, some keeps don’t hold up on replication, so chasing small improvements can amplify noise and discards are sometimes preferable.
- It introduces three CLIs—autojudge, autosteer, and autoevolve—that estimate the noise floor, suggest next experiments, and run competing strategies across git worktrees to cross-pollinate ideas.
- It promises that you wake up to ranked results with confidence scores and a clear next step instead of raw TSV logs.
- Caveats include that the noise floor stabilizes after about five experiments, autosteer’s suggestions are category-level, autoevolve is the newest and least polished, and you can install them with pip install autojudge autosteer autoevolve.
Related Articles
ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH
Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to
Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents
Dev.to
Perplexity Hub
Dev.to
How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to