Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
arXiv cs.AI / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that conventional benchmarks like pass@1 primarily measure single-attempt capability, but fail to capture reliability needs for long-horizon LLM agents where success must be consistent across repeated attempts and varying task durations.
- It introduces a “reliability science” evaluation framework using four new metrics—Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP)—to quantify how reliability degrades over time.
- Across 10 models and a large benchmark of 23,392 episodes over 396 tasks grouped by duration buckets and domains, the study finds reliability decay is domain-stratified, with some domains degrading sharply while others remain relatively stable.
- The authors report that capability and reliability rankings can diverge substantially at long horizons (including multi-rank inversions), and that “frontier” models show the highest meltdown rates due to ambitious multi-step strategies that can spiral.
- A further finding is that adding memory scaffolds universally worsens long-horizon performance across all evaluated models, motivating reliability as an evaluation dimension on par with raw capability.
Related Articles

Black Hat Asia
AI Business

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

How to Create AI Videos in 20 Minutes (3 Free Tools, Zero Experience)
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to