The Harness is All You Need

Dev.to / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The article argues that AI development is splitting into two camps: “super agents” that handle multi-step, tool-using tasks versus “mini-agents” that focus on narrow, high-accuracy decisions at high volume.
It claims the main production bottleneck isn’t agent architecture or autonomy but the “harness” and evaluation/labeling pipeline that makes mini-agents reliable in real workflows.
Using a 10,000-binary-decisions/day example, it shows how small accuracy improvements for mini-agents can translate into large economic gains due to fewer errors and less costly human review.
The author connects this to broader shifts described by Karpathy’s “Software 2.0” framing—where standardized models and prompts make the labeled evaluation data the key differentiator—and to Anthropic’s guidance to start simple and add complexity only when necessary.

On building mini-agents that actually work in production, and why nobody's talking about the real problem.

There's a split happening in AI that most people are missing.

Camp A builds "super agents." Autonomous systems that browse the web, write code, use dozens of tools, plan multi-step workflows. Devin, Manus, Operator. Gets the tweets, the funding, the demos.

Camp B builds "mini-agents." Narrow, single-task decision systems optimized for accuracy at volume. No tools, no browsing, no multi-step planning. One question, answered correctly, tens of thousands of times per day.

Camp B is almost entirely silent. I've spent months in Camp B, and I think the harness that produces these agents matters more than any individual agent.

The Economics

Consider a business process with 10,000 binary decisions per day. Each takes ~5 minutes of skilled labor. That's 104 full-time employees.

	Super Agent (80%)	Mini Agent (97%)
Automated	8,000/day	9,700/day
Human review	~2,000/day (slower to fix)	~300/day (cleanly flagged)
Headcount	~50 + 5 eng	~6
Errors/day	2,000	300
Cost/decision	~$0.05	~$0.002

At $50-500 per error (healthcare, finance, legal), that accuracy gap is worth $85K-$850K per day.

The super agent makes a better story. The mini agent makes a better business.

What the Leaders are Saying (But Nobody's Connecting)

The pieces of this puzzle are scattered across blog posts and tweets from the top people in AI. Nobody's assembled them into a coherent picture for production systems.

Andrej Karpathy framed the macro shift in his seminal "Software 2.0" essay:

"The process of training the neural network compiles the dataset into the binary. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active 'software development' takes the form of curating, growing, massaging and cleaning labeled datasets."

This was written about neural nets, but it applies perfectly to prompt-based mini-agents. The prompt is increasingly commodity. The labeled data you evaluate against is the real IP.

Anthropic, in their influential "Building Effective Agents" guide, made a point that most readers glossed over:

"When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all... For many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."

Read that again. Anthropic, the company building Claude, is telling you that most of the time you don't need an agent at all. A single well-crafted LLM call with good retrieval is enough. That's exactly the mini-agent pattern.

Hamel Husain, who built the precursor to GitHub Copilot (CodeSearchNet) and now consults for AI companies, puts it bluntly:

"I've found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems... If you streamline your evaluation process, all other activities become easy."

He's been saying this for years. Eval is the bottleneck, not the model. Not the prompt. The eval.

Jason Wei (OpenAI, chain-of-thought prompting) on what makes evals actually work:

"It's critical to have a single-number metric. I can't think of any great evals that don't have a single-number metric."

And on the most common failure mode:

"Most of the non-successful evals make at least one mistake... If an eval is too complicated, it will be hard for people to understand it and it will simply be used less."

Simple metric. Simple eval. Run it a lot. This is the whole recipe.

Eugene Yan (Amazon, formerly Apple) on the eval hierarchy:

"IMHO, accuracy is too coarse a metric to be useful. We'd need to separate it into recall and precision at minimum, ideally across thresholds."

He's right, and it goes further than he suggests. For business decisions, you need to weight by dollars, not just by count. More on this below.

Finding 1: The Model is Already Smart Enough

Biggest surprise from our experiments. We tested the same task across models ranging from GPT-5-mini ($0.75/1M input) to o1 ($45/1M).

The accuracy gap between cheapest and most expensive: about 2 percentage points. For a 54x cost multiplier.

The model was never the bottleneck.

Finding 2: Context Beats Instructions

We started with rule-heavy prompts. Thousands of tokens of decision trees. Ceiling: 84.9%.

Then we flipped it: short prompt, rich data. Historical outcomes, statistical pay rates, similar past cases.

Same 500-token prompt. Only the input data changed. 84.9% to 97.3%.

This confirms what Karpathy said about Software 2.0: the data is the program. Our prompt (the "architecture") stayed the same. The "training data" (context window) did all the work.

Finding 3: Long Prompts Overfit

Held across 300+ experiments.

Long prompts with extensive decision trees scored great on training data, terribly on new data. Best-performing architecture: ~500 tokens of instructions, then all data.

Hamel Husain saw the exact same pattern with his clients:

"Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples."

He describes this as one of the key symptoms of an AI product that's hit a plateau. The solution isn't a longer prompt. It's a better eval loop that tells you what's actually working.

The Harness

So if the model is smart enough and the prompt is short... what's the real engineering problem?

It's the iteration system. The harness.

The core loop:

Generate prompt variants (manual or meta-prompting)
Evaluate each against labeled ground truth
Score with a metric tied to business value
Analyze failure patterns of top performers
Evolve new variants targeting those specific failures
Repeat

No multi-agent orchestration. No RAG. No chain-of-thought scaffolding. It's a for loop with an API call and a scoring function.

We ran 330,000+ evaluations across 300+ prompt variants. The harness doesn't get tired, doesn't have confirmation bias, doesn't fall in love with its favorite prompt.

This maps directly to Anthropic's advice on building effective agents. They describe a "workflow" pattern (predefined code paths orchestrating LLM calls) as distinct from an "agent" pattern (LLM dynamically directing its own process). The harness is firmly in workflow territory. And Anthropic says:

"Workflows offer predictability and consistency for well-defined tasks."

Exactly. Mini-agents solve well-defined tasks. Predictability is the whole point.

The Two Hard Parts

1. Ground Truth Data

Thousands of labeled examples where humans already made the correct decision. Split into training, validation, and test sets.

Jason Wei's advice applies directly here:

"If an eval doesn't have enough examples, it will be noisy and a bad UI for researchers... It's good to have at least 1,000 examples for your eval."

For business decision tasks, I'd push that to 5,000+. You need enough to split three ways and still have statistical significance per segment.

If you're in a domain where historical decisions are recorded in a database (claims, underwriting, coding), every past decision is a free labeled example.

2. The Objective Function

Accuracy (% correct) treats all errors equally. In reality, they're not.

Component	Weight	Measures
Dollar-weighted sensitivity	60%	Of revenue that should be pursued, what % did we catch?
Dollar-weighted specificity	20%	Of cases that should be closed, what % did we correctly close?
Unweighted accuracy	20%	Standard correctness (sanity check)

Jason Wei says a single-number metric is critical. He's right. But that number needs to reflect business reality, not just classification performance. Weight by dollars, not by count.

What Evolution Looks Like

The interesting part is analyzing winner failures.

Example: our best prompt was great at everything except injectable drug claims with a specific denial code. The code was ambiguous: "wrong billing info" (resubmit) or "contractually excluded" (write off), depending on the drug.

The fix wasn't a better rule. It was better context: the historical pay rate for that drug+payer combination.

Payer has never paid for this drug? Probably excluded. Write off.
Payer usually pays but denied this one? Probably billing error. Resubmit.

Data disambiguated what the denial code couldn't. Each percentage point from 90% to 97% came from this kind of targeted evolution.

Where This Generalizes

The pattern works for any decision that is:

High volume (thousands/day)
Currently human-made (automation headroom exists)
Historically recorded (ground truth is available)
Economically quantifiable (real objective function possible)

Candidates: loan underwriting, content moderation, insurance pricing, tax categorization, customer service routing, QA in manufacturing, fraud detection.

Each gets its own harness, its own ground truth, its own scoring function, its own tournament. The agents that emerge are disposable. The harness is the asset.

What Needs to Happen

Drawing from the research community's own recommendations:

Open-source eval harness frameworks. Plug in labeled data + scoring function + prompt generator. Get a validated champion. LMSYS built Chatbot Arena for ranking general-purpose chatbots with Elo ratings. We need the equivalent for domain-specific production tasks.
Research on prompt evolution strategies. Random mutation vs. targeted evolution vs. genetic algorithms. Jason Wei notes that "evals are incentives for the research community, and breakthroughs are often closely linked to a huge performance jump on some eval." The same applies to prompt engineering. Build the eval, and the prompt engineering becomes tractable.
Domain-specific benchmark datasets. ML has ImageNet, GLUE, SuperGLUE, GSM8K. Mini-agents have nothing.
Cost-aware evaluation. 97% at $0.002/decision beats 98% at $0.10/decision. Every leaderboard should include cost per decision alongside accuracy.

The Bottom Line

The sexiest problem in AI is building an agent that can do anything.

The most valuable problem is building a system that produces agents that do one thing very, very well.

The agent is disposable. The harness is the moat.

Bo Romir builds applied AI decision systems. Previously ran 330K+ automated evaluations across 300+ prompt variants in healthcare revenue cycle management.

References:

Karpathy, A. "Software 2.0" (2017)
Anthropic. "Building Effective Agents" (2024)
Husain, H. "Your AI Product Needs Evals" (2024)
Wei, J. "Successful Language Model Evals" (2024)
Yan, E. "Task-Specific LLM Evals that Do & Don't Work" (2024)
Zheng, L. et al. "Chatbot Arena: Benchmarking LLMs in the Wild" (2023)

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/6DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

MCP-Native Agent Discovery: How AI Agents Find Each Other

Dev.to

Building a Constitutional Framework for Autonomous AI Agents

Dev.to

Lainux -- The Secure OS for AI Builders

Dev.to

The Harness is All You Need

Key Points

The Economics

What the Leaders are Saying (But Nobody's Connecting)

Finding 1: The Model is Already Smart Enough

Finding 2: Context Beats Instructions

Finding 3: Long Prompts Overfit

The Harness

The Two Hard Parts

1. Ground Truth Data

2. The Objective Function

What Evolution Looks Like

Where This Generalizes

What Needs to Happen

The Bottom Line

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

MCP-Native Agent Discovery: How AI Agents Find Each Other

Building a Constitutional Framework for Autonomous AI Agents

Lainux -- The Secure OS for AI Builders

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer