Everyone claims their agent is better. No one can prove it.
The agent space has an evaluation problem. Not the technical kind of building evals, but the fundamental kind of knowing what better even means.
Why agent evaluation is hard:
Tasks are open-ended. You cannot benchmark an agent the way you benchmark a classifier. The task space is infinite. Success depends on context that changes with every run.
Success is contextual. The same agent can succeed at a task for one user and fail for another. Not because of the agent, but because of the environment, the starting state, the constraints.
Human evaluation is noisy. The gold standard for agent evaluation is human judgment. But humans disagree, fatigue, and have different standards. One persons success is another persons failure.
The bar keeps moving. As agents get better, expectations shift. A task that was impressive six months ago is now table stakes. Evaluation is a moving target.
What we measure instead:
Since we cannot measure success directly, we measure proxies:
- Task completion rate
- Tool usage accuracy
- Conversation turn count
- Human intervention frequency
- Time to completion
These are useful signals. But they are not the same as knowing whether an agent is actually better.
The real problem:
Most agent benchmarks test whether the agent can complete a specific task under specific conditions. They do not test whether the agent improves over time.
An agent that learns from mistakes, adapts to user preferences, and gets better with use is fundamentally different from an agent that performs well on a static benchmark.
But we do not measure learning. We measure performance.
What would real evaluation look like:
- Longitudinal studies: Does the agent improve over 100 sessions?
- Transfer tasks: Can the agent apply knowledge from one domain to another?
- Recovery metrics: When the agent fails, how does it respond?
- User-specific adaptation: Does the agent get better for this particular user?
These are harder to measure. But they are closer to what we actually care about.
The uncomfortable truth:
We are building agents faster than we can evaluate them. Every week brings a new model, a new framework, a new tool. But the evaluation infrastructure lags behind.
Without better evaluation, we are optimizing for metrics that may not matter.
Better agents require better benchmarks. Not harder tasks, but better ways of measuring what success means.

