On building mini-agents that actually work in production, and why nobody's talking about the real problem.
There's a split happening in AI that most people are missing.
Camp A builds "super agents." Autonomous systems that browse the web, write code, use dozens of tools, plan multi-step workflows. Devin, Manus, Operator. Gets the tweets, the funding, the demos.
Camp B builds "mini-agents." Narrow, single-task decision systems optimized for accuracy at volume. No tools, no browsing, no multi-step planning. One question, answered correctly, tens of thousands of times per day.
Camp B is almost entirely silent. I've spent months in Camp B, and I think the harness that produces these agents matters more than any individual agent.
The Economics
Consider a business process with 10,000 binary decisions per day. Each takes ~5 minutes of skilled labor. That's 104 full-time employees.
| Super Agent (80%) | Mini Agent (97%) | |
|---|---|---|
| Automated | 8,000/day | 9,700/day |
| Human review | ~2,000/day (slower to fix) | ~300/day (cleanly flagged) |
| Headcount | ~50 + 5 eng | ~6 |
| Errors/day | 2,000 | 300 |
| Cost/decision | ~$0.05 | ~$0.002 |
At $50-500 per error (healthcare, finance, legal), that accuracy gap is worth $85K-$850K per day.
The super agent makes a better story. The mini agent makes a better business.
What the Leaders are Saying (But Nobody's Connecting)
The pieces of this puzzle are scattered across blog posts and tweets from the top people in AI. Nobody's assembled them into a coherent picture for production systems.
Andrej Karpathy framed the macro shift in his seminal "Software 2.0" essay:
"The process of training the neural network compiles the dataset into the binary. In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active 'software development' takes the form of curating, growing, massaging and cleaning labeled datasets."
This was written about neural nets, but it applies perfectly to prompt-based mini-agents. The prompt is increasingly commodity. The labeled data you evaluate against is the real IP.
Anthropic, in their influential "Building Effective Agents" guide, made a point that most readers glossed over:
"When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all... For many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."
Read that again. Anthropic, the company building Claude, is telling you that most of the time you don't need an agent at all. A single well-crafted LLM call with good retrieval is enough. That's exactly the mini-agent pattern.
Hamel Husain, who built the precursor to GitHub Copilot (CodeSearchNet) and now consults for AI companies, puts it bluntly:
"I've found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems... If you streamline your evaluation process, all other activities become easy."
He's been saying this for years. Eval is the bottleneck, not the model. Not the prompt. The eval.
Jason Wei (OpenAI, chain-of-thought prompting) on what makes evals actually work:
"It's critical to have a single-number metric. I can't think of any great evals that don't have a single-number metric."
And on the most common failure mode:
"Most of the non-successful evals make at least one mistake... If an eval is too complicated, it will be hard for people to understand it and it will simply be used less."
Simple metric. Simple eval. Run it a lot. This is the whole recipe.
Eugene Yan (Amazon, formerly Apple) on the eval hierarchy:
"IMHO, accuracy is too coarse a metric to be useful. We'd need to separate it into recall and precision at minimum, ideally across thresholds."
He's right, and it goes further than he suggests. For business decisions, you need to weight by dollars, not just by count. More on this below.
Finding 1: The Model is Already Smart Enough
Biggest surprise from our experiments. We tested the same task across models ranging from GPT-5-mini ($0.75/1M input) to o1 ($45/1M).
The accuracy gap between cheapest and most expensive: about 2 percentage points. For a 54x cost multiplier.
The model was never the bottleneck.
Finding 2: Context Beats Instructions
We started with rule-heavy prompts. Thousands of tokens of decision trees. Ceiling: 84.9%.
Then we flipped it: short prompt, rich data. Historical outcomes, statistical pay rates, similar past cases.
Same 500-token prompt. Only the input data changed. 84.9% to 97.3%.
This confirms what Karpathy said about Software 2.0: the data is the program. Our prompt (the "architecture") stayed the same. The "training data" (context window) did all the work.
Finding 3: Long Prompts Overfit
Held across 300+ experiments.
Long prompts with extensive decision trees scored great on training data, terribly on new data. Best-performing architecture: ~500 tokens of instructions, then all data.
Hamel Husain saw the exact same pattern with his clients:
"Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples."
He describes this as one of the key symptoms of an AI product that's hit a plateau. The solution isn't a longer prompt. It's a better eval loop that tells you what's actually working.
The Harness
So if the model is smart enough and the prompt is short... what's the real engineering problem?
It's the iteration system. The harness.
The core loop:
- Generate prompt variants (manual or meta-prompting)
- Evaluate each against labeled ground truth
- Score with a metric tied to business value
- Analyze failure patterns of top performers
- Evolve new variants targeting those specific failures
- Repeat
No multi-agent orchestration. No RAG. No chain-of-thought scaffolding. It's a for loop with an API call and a scoring function.
We ran 330,000+ evaluations across 300+ prompt variants. The harness doesn't get tired, doesn't have confirmation bias, doesn't fall in love with its favorite prompt.
This maps directly to Anthropic's advice on building effective agents. They describe a "workflow" pattern (predefined code paths orchestrating LLM calls) as distinct from an "agent" pattern (LLM dynamically directing its own process). The harness is firmly in workflow territory. And Anthropic says:
"Workflows offer predictability and consistency for well-defined tasks."
Exactly. Mini-agents solve well-defined tasks. Predictability is the whole point.
The Two Hard Parts
1. Ground Truth Data
Thousands of labeled examples where humans already made the correct decision. Split into training, validation, and test sets.
Jason Wei's advice applies directly here:
"If an eval doesn't have enough examples, it will be noisy and a bad UI for researchers... It's good to have at least 1,000 examples for your eval."
For business decision tasks, I'd push that to 5,000+. You need enough to split three ways and still have statistical significance per segment.
If you're in a domain where historical decisions are recorded in a database (claims, underwriting, coding), every past decision is a free labeled example.
2. The Objective Function
Accuracy (% correct) treats all errors equally. In reality, they're not.
| Component | Weight | Measures |
|---|---|---|
| Dollar-weighted sensitivity | 60% | Of revenue that should be pursued, what % did we catch? |
| Dollar-weighted specificity | 20% | Of cases that should be closed, what % did we correctly close? |
| Unweighted accuracy | 20% | Standard correctness (sanity check) |
Jason Wei says a single-number metric is critical. He's right. But that number needs to reflect business reality, not just classification performance. Weight by dollars, not by count.
What Evolution Looks Like
The interesting part is analyzing winner failures.
Example: our best prompt was great at everything except injectable drug claims with a specific denial code. The code was ambiguous: "wrong billing info" (resubmit) or "contractually excluded" (write off), depending on the drug.
The fix wasn't a better rule. It was better context: the historical pay rate for that drug+payer combination.
- Payer has never paid for this drug? Probably excluded. Write off.
- Payer usually pays but denied this one? Probably billing error. Resubmit.
Data disambiguated what the denial code couldn't. Each percentage point from 90% to 97% came from this kind of targeted evolution.
Where This Generalizes
The pattern works for any decision that is:
- High volume (thousands/day)
- Currently human-made (automation headroom exists)
- Historically recorded (ground truth is available)
- Economically quantifiable (real objective function possible)
Candidates: loan underwriting, content moderation, insurance pricing, tax categorization, customer service routing, QA in manufacturing, fraud detection.
Each gets its own harness, its own ground truth, its own scoring function, its own tournament. The agents that emerge are disposable. The harness is the asset.
What Needs to Happen
Drawing from the research community's own recommendations:
Open-source eval harness frameworks. Plug in labeled data + scoring function + prompt generator. Get a validated champion. LMSYS built Chatbot Arena for ranking general-purpose chatbots with Elo ratings. We need the equivalent for domain-specific production tasks.
Research on prompt evolution strategies. Random mutation vs. targeted evolution vs. genetic algorithms. Jason Wei notes that "evals are incentives for the research community, and breakthroughs are often closely linked to a huge performance jump on some eval." The same applies to prompt engineering. Build the eval, and the prompt engineering becomes tractable.
Domain-specific benchmark datasets. ML has ImageNet, GLUE, SuperGLUE, GSM8K. Mini-agents have nothing.
Cost-aware evaluation. 97% at $0.002/decision beats 98% at $0.10/decision. Every leaderboard should include cost per decision alongside accuracy.
The Bottom Line
The sexiest problem in AI is building an agent that can do anything.
The most valuable problem is building a system that produces agents that do one thing very, very well.
The agent is disposable. The harness is the moat.
Bo Romir builds applied AI decision systems. Previously ran 330K+ automated evaluations across 300+ prompt variants in healthcare revenue cycle management.
References:
- Karpathy, A. "Software 2.0" (2017)
- Anthropic. "Building Effective Agents" (2024)
- Husain, H. "Your AI Product Needs Evals" (2024)
- Wei, J. "Successful Language Model Evals" (2024)
- Yan, E. "Task-Specific LLM Evals that Do & Don't Work" (2024)
- Zheng, L. et al. "Chatbot Arena: Benchmarking LLMs in the Wild" (2023)








