The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

arXiv cs.AI / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that LLM-based agentic systems reliably handle short- to mid-horizon tasks but commonly fail on long-horizon tasks requiring extended, interdependent action sequences.
  • It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct long-horizon tasks and measure where and how agent failures degrade with horizon length.
  • Using HORIZON, the authors evaluate state-of-the-art agents (GPT-5 variants and Claude models) and collect 3,100+ trajectories across four agentic domains to characterize horizon-dependent failure patterns.
  • They propose a trajectory-grounded “LLM-as-a-Judge” pipeline to attribute failures in a scalable and reproducible way, validated against human annotations with substantial agreement (kappa values reported).
  • The authors release a HORIZON Leaderboard and invite community contributions to support ongoing, principled comparison and diagnosis of long-horizon agent behavior.

Abstract

Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator \kappa=0.61; human-judge \kappa=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \href{https://xwang2775.github.io/horizon-leaderboard/}{HORIZON Leaderboard} and welcome contributions from the community.