Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper presents AgentProp-Bench, a 2,000-task benchmark (2,300 traces) for evaluating tool-using LLM agents, including a human-validated 100-label subset to test assumptions about evaluation reliability.
It finds that simple substring-based judging is effectively chance-level compared with human annotation (kappa=0.049), while a three-LLM ensemble judge improves agreement to moderate reliability (kappa=0.432) with a conservative bias.
The study quantifies error propagation, showing that a parameter-level injection can lead to an incorrect final answer with a human-calibrated probability of about 0.62 (range 0.46–0.73 across models).
Rejection (detecting bad parameters) and recovery (correcting after acceptance) are largely independent capabilities across models, as indicated by low correlation (Spearman rho=0.126, p=0.747).
A tuned runtime interceptor reduces hallucination for GPT-4o-mini by 23.0 percentage points, but it has no significant effect for Gemini-2.0-Flash because its aggressive parameter rejection already prevents the targeted failure mode.

Abstract

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram-ai/agenthallu-bench.

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

v0.21.1

Ollama Releases

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)

Dev.to

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

v0.21.1

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer