Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that conventional benchmarks like pass@1 primarily measure single-attempt capability, but fail to capture reliability needs for long-horizon LLM agents where success must be consistent across repeated attempts and varying task durations.
It introduces a “reliability science” evaluation framework using four new metrics—Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP)—to quantify how reliability degrades over time.
Across 10 models and a large benchmark of 23,392 episodes over 396 tasks grouped by duration buckets and domains, the study finds reliability decay is domain-stratified, with some domains degrading sharply while others remain relatively stable.
The authors report that capability and reliability rankings can diverge substantially at long horizons (including multi-rank inversions), and that “frontier” models show the highest meltdown rates due to ambitious multi-step strategies that can spiral.
A further finding is that adding memory scaffolds universally worsens long-horizon performance across all evaluated models, motivating reliability as an evaluation dimension on par with raw capability.

Abstract

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.