Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

arXiv cs.AI / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • Existing LLM and agent evaluation benchmarks (e.g., HELM, MT-Bench, AgentBench, BIG-bench) mainly assume controlled, single-session lab settings and do not cover production realities like compounding errors, cascading tool failures, and output drift without long-horizon ground truth.
  • The paper introduces a taxonomy of seven production-specific failure modes for agentic AI systems, based on observations from deployments at billion-event scale.
  • It empirically shows that common metrics (ROUGE, BERTScore, accuracy/AUC) and standard agentic benchmarks often miss these failures—fully missing four modes and only detecting three after multiple evaluation-cycle delays.
  • To address these gaps, the authors propose PAEF (Production Agentic Evaluation Framework), a five-dimension framework with an open-source reference implementation aimed at continuous evaluation on live production traffic rather than episodic benchmark runs.

Abstract

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.