Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
arXiv cs.AI / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper presents AgentProp-Bench, a 2,000-task benchmark (2,300 traces) for evaluating tool-using LLM agents, including a human-validated 100-label subset to test assumptions about evaluation reliability.
- It finds that simple substring-based judging is effectively chance-level compared with human annotation (kappa=0.049), while a three-LLM ensemble judge improves agreement to moderate reliability (kappa=0.432) with a conservative bias.
- The study quantifies error propagation, showing that a parameter-level injection can lead to an incorrect final answer with a human-calibrated probability of about 0.62 (range 0.46–0.73 across models).
- Rejection (detecting bad parameters) and recovery (correcting after acceptance) are largely independent capabilities across models, as indicated by low correlation (Spearman rho=0.126, p=0.747).
- A tuned runtime interceptor reduces hallucination for GPT-4o-mini by 23.0 percentage points, but it has no significant effect for Gemini-2.0-Flash because its aggressive parameter rejection already prevents the targeted failure mode.
Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.
Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE
Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs
The Register
v0.21.1
Ollama Releases

How I Built an AI Agent That Investigates Cloud Bill Spikes (Architecture Inside)
Dev.to