Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that conventional benchmarks like pass@1 primarily measure single-attempt capability, but fail to capture reliability needs for long-horizon LLM agents where success must be consistent across repeated attempts and varying task durations.
  • It introduces a “reliability science” evaluation framework using four new metrics—Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP)—to quantify how reliability degrades over time.
  • Across 10 models and a large benchmark of 23,392 episodes over 396 tasks grouped by duration buckets and domains, the study finds reliability decay is domain-stratified, with some domains degrading sharply while others remain relatively stable.
  • The authors report that capability and reliability rankings can diverge substantially at long horizons (including multi-rank inversions), and that “frontier” models show the highest meltdown rates due to ambitious multi-step strategies that can spiral.
  • A further finding is that adding memory scaffolds universally worsens long-horizon performance across all evaluated models, motivating reliability as an evaluation dimension on par with raw capability.

Abstract

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.