The Evaluation Gap: Why We Dont Know If Agents Are Getting Better

Dev.to / 4/5/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that the agent evaluation problem is not just technical but conceptual, because it is unclear what “better” should mean for open-ended, context-dependent tasks.
It explains why agent benchmarks struggle: tasks are effectively infinite, success varies by environment and user constraints, and human judgment is noisy and inconsistent.
It notes that common evaluation proxies (task completion, tool accuracy, turns, interventions, and time) do not actually tell whether an agent is improving, learning from mistakes, or adapting over time.
The piece proposes more meaningful evaluation approaches such as longitudinal studies, transfer tests, recovery/failure response metrics, and user-specific adaptation measurements.
It concludes that the industry is moving faster to build and deploy new agents than evaluation infrastructure can keep up, risking optimization for metrics that may not reflect real user value.

Everyone claims their agent is better. No one can prove it.

The agent space has an evaluation problem. Not the technical kind of building evals, but the fundamental kind of knowing what better even means.

Why agent evaluation is hard:

Tasks are open-ended. You cannot benchmark an agent the way you benchmark a classifier. The task space is infinite. Success depends on context that changes with every run.
Success is contextual. The same agent can succeed at a task for one user and fail for another. Not because of the agent, but because of the environment, the starting state, the constraints.
Human evaluation is noisy. The gold standard for agent evaluation is human judgment. But humans disagree, fatigue, and have different standards. One persons success is another persons failure.
The bar keeps moving. As agents get better, expectations shift. A task that was impressive six months ago is now table stakes. Evaluation is a moving target.

What we measure instead:

Since we cannot measure success directly, we measure proxies:

Task completion rate
Tool usage accuracy
Conversation turn count
Human intervention frequency
Time to completion

These are useful signals. But they are not the same as knowing whether an agent is actually better.

The real problem:

Most agent benchmarks test whether the agent can complete a specific task under specific conditions. They do not test whether the agent improves over time.

An agent that learns from mistakes, adapts to user preferences, and gets better with use is fundamentally different from an agent that performs well on a static benchmark.

But we do not measure learning. We measure performance.

What would real evaluation look like:

Longitudinal studies: Does the agent improve over 100 sessions?
Transfer tasks: Can the agent apply knowledge from one domain to another?
Recovery metrics: When the agent fails, how does it respond?
User-specific adaptation: Does the agent get better for this particular user?

These are harder to measure. But they are closer to what we actually care about.

The uncomfortable truth:

We are building agents faster than we can evaluate them. Every week brings a new model, a new framework, a new tool. But the evaluation infrastructure lags behind.

Without better evaluation, we are optimizing for metrics that may not matter.

Better agents require better benchmarks. Not harder tasks, but better ways of measuring what success means.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/5DailyView insight →

Black Hat Asia

AI Business

Improved markdown quality, code intelligence for 248 languages, and more in Kreuzberg v4.7.0

Reddit r/LocalLLaMA

AGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth It

Dev.to

The AI Agent's Guide to Building a Writing Portfolio

Dev.to

My Claude Code Buddy Moved Into My MacBook's Notch and I Can't Stop Looking at It

Dev.to

The Evaluation Gap: Why We Dont Know If Agents Are Getting Better

Key Points

💡 Insights using this article

Related Articles

Black Hat Asia

Improved markdown quality, code intelligence for 248 languages, and more in Kreuzberg v4.7.0

AGI Won’t Automate Most Jobs—Economist Reveals Why They’re Not Worth It

The AI Agent's Guide to Building a Writing Portfolio

My Claude Code Buddy Moved Into My MacBook's Notch and I Can't Stop Looking at It

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer