The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
arXiv cs.AI / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper finds that LLM-based agentic systems reliably handle short- to mid-horizon tasks but commonly fail on long-horizon tasks requiring extended, interdependent action sequences.
- It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct long-horizon tasks and measure where and how agent failures degrade with horizon length.
- Using HORIZON, the authors evaluate state-of-the-art agents (GPT-5 variants and Claude models) and collect 3,100+ trajectories across four agentic domains to characterize horizon-dependent failure patterns.
- They propose a trajectory-grounded “LLM-as-a-Judge” pipeline to attribute failures in a scalable and reproducible way, validated against human annotations with substantial agreement (kappa values reported).
- The authors release a HORIZON Leaderboard and invite community contributions to support ongoing, principled comparison and diagnosis of long-horizon agent behavior.




