Karpathy's Autoresearch: Improving Agentic Coding Skills

Dev.to / 3/25/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Andrej Karpathy公開したautoresearchワークフローは、実験結果を使って学習(や改善)ループを自律的に回し、時間をかけてモデル(の振る舞い)を改善していく仕組みだと説明されている。
  • 著者は、Claude Codeのようなエージェント型コーディング環境を「複数スキル・メモリ・サブエージェント・フック等のエージェント的ハーネス」と捉え、従来の経験則ベースの評価から脱して決定的な実験で改良する枠組みを提案している。
  • 改善ループを設計するために、タスクを「要請→探索→計画→実行→レビュー」に分け、ユーザー入力が必要な対話的部分は除外することで実験を単純化している。
  • 自己改善の核は、世代(新バージョン)を結果で評価する実験であり、比較可能にするために出力と計測を安定・決定的にする必要があると述べている。
  • 評価指標として、トークン使用量(コスト最適化にも有用)、実行時間、ツール呼び出し回数(不要なオーバーヘッドや権限の抑制)などを設定し、品質だけでなく自律性・並列性・効率も同時に高めることを目指している。

Introduction

Recently, Andrej Karpathy made his autoresearch workflow public: https://github.com/karpathy/autoresearch. The idea is to autonomously improve a model's training process based on experiment results. Using Claude Code, you run this loop for hours or days and end up with a better model. The whole flow is described in the program.md file as a skill: https://github.com/karpathy/autoresearch/blob/master/program.md

I'm not training any LLMs for work or even as a hobby, but I do a lot of coding, now mostly with Claude Code. To generate high-quality code that consistently follows conventions and standards, I use multiple skills, memory files, sub-agents, hooks, etc., let's call it an agentic harness.

However, I evaluate this harness rather naively, not based on experiments or metrics - let’s say, not scientifically. The usual approach has been: test best practices that feel useful -> if they work -> incorporate them into the workflow. Or, if issues are caught during human review -> fix the workflow.

But I think I can borrow ideas from Karpathy’s autoresearch and adapt them to improve my agentic coding harness based on deterministic experiments.

Let's design a coding skill auto-improvement loop.

Solution

Assume we have a skill that implements a common workflow for daily coding:

take a request/task -> explore -> plan -> execute -> review.

For simplicity, we exclude any interactive steps that require user input. Optimizing those would require a more complex experimental framework.

The core of the autoresearch loop is an experiment that evaluates a new version (generation) based on its results. For that, we need deterministic experiments and stable metrics. This means outputs and measurements must be comparable across runs and generations.

What is the goal of this skill?

To determine the right steps and provide the right context to the coding agent so that the resulting code is predictable, follows standards, and passes human review.

But code quality is not the only concern. We also want:

  • High autonomy (minimal escalation to humans)
  • Ability to run many tasks in parallel
  • Minimal token usage
  • Low execution time 

Evaluation Framework

We define a collection of test cases for the skill:

Request/Task -> Reference Code

Metrics:

  • Token usage (end-to-end and per step), or even cost in real money - also helps optimize model selection per step
  • Execution time (end-to-end and per step)
  • Number of tool calls (to reduce unnecessary permissions and overhead)
  • Number of errors, self-corrections, or full aborts (when the agent cannot proceed without user input)
  • Logs of issues, self-corrections, and fixes

In the original autoresearch, a single metric (val_bpb) determines whether a version advances.
For a coding skill, we need multiple key metrics:

  • Test cases passed
  • Time
  • Token usage
  • Other metrics act as signals for future improvements.

For simplicity at the design stage, we use a binary score:

  • 0 → output code does not match the reference
  • 1 → output code matches the reference Each test case gives 1 point if it passes.

Additionally:

  • +1 point if execution time improves vs. previous version
  • +1 point if cost improves vs. previous version Final score = sum of all points

Decision rule:

  • If current_score > previous_score -> advance
  • Else -> discard and revert Since we have multiple test cases, correctness dominates the score, which is desirable. Only after maximizing quality do time and cost become deciding factors.

Auto-Improvement Loop

The loop is very similar to the original autoresearch. Each iteration is stateless:

  1. Take the current SKILL.md, analyze it, and apply a change based on a specific experiment idea. Boundaries are important: limit the scope of changes. We want iterative improvement, not full rewrites. At the same time, changes should not be too small, since evaluation is noisy.
  2. Run all test cases. Each test case should be executed multiple times to smooth out non-determinism.
  3. Evaluate results. Aggregate measurements. Compute the total score
  4. Compare with the previous best version. If better -> commit as the new baseline. If worse -> discard.
  5. Repeat with a new experiment idea.

The diagram of the autoimprove loop:

Autoimprove loop diagram

Conclusion

I designed an auto-improvement loop for coding skills based on Andrej Karpathy’s autoresearch approach, originally created for improving LLM training loops.

At a high level, nothing prevents us from applying the same idea to agentic coding. In theory, an agent could autonomously “train” its own coding skills based on specific use cases and a codebase - without human supervision.

That said, there are still many challenges:

  • Defining high-quality test cases that cover edge cases
  • Setting proper boundaries for skill modifications
  • Forcing the agent to explore the full design space (sub-agents, memory strategies, tooling, etc.)
  • Deciding when an agent should pull in new tools (CLIs, MCPs) or even build them from scratch

These challenges will likely surface during implementation and early runs. I'll share more once I have initial results and a working version.