Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper presents AgentProp-Bench, a 2,000-task benchmark (2,300 traces) for evaluating tool-using LLM agents, including a human-validated 100-label subset to test assumptions about evaluation reliability.
  • It finds that simple substring-based judging is effectively chance-level compared with human annotation (kappa=0.049), while a three-LLM ensemble judge improves agreement to moderate reliability (kappa=0.432) with a conservative bias.
  • The study quantifies error propagation, showing that a parameter-level injection can lead to an incorrect final answer with a human-calibrated probability of about 0.62 (range 0.46–0.73 across models).
  • Rejection (detecting bad parameters) and recovery (correcting after acceptance) are largely independent capabilities across models, as indicated by low correlation (Spearman rho=0.126, p=0.747).
  • A tuned runtime interceptor reduces hallucination for GPT-4o-mini by 23.0 percentage points, but it has no significant effect for Gemini-2.0-Flash because its aggressive parameter rejection already prevents the targeted failure mode.

Abstract

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram-ai/agenthallu-bench.