Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
arXiv cs.AI / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that task-completion rate alone can miss important differences in how well LLM agents track intermediate cumulative state.
- It introduces WMF-AM, a “no-scratchpad” calibrated probe for cumulative arithmetic state tracking, and evaluates it across 20 open-weight model families.
- In a pre-specified multiple-comparison-corrected analysis, WMF-AM significantly predicts deterministic 10-task agent performance (Kendall’s tau = 0.612, p < 0.001).
- Construct-isolation ablations indicate that the main challenge for agents under load is cumulative state tracking, not just single-step arithmetic or entity tracking.
- The authors note that K-calibration helps keep the probe discriminative versus earlier fixed-depth benchmarks, while generalization beyond the studied open-weight set remains an open question.



