KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
arXiv cs.AI / 5/1/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces KellyBench, a new benchmark aimed at evaluating long-horizon sequential decision making in non-stationary, open-ended settings rather than narrow procedural tasks where benchmarks are already saturated.
- KellyBench simulates the 2023–24 English Premier League season in a sequential sports-betting environment, challenging agents to maximize long-term bankroll growth using detailed historical data (advanced stats, lineups, and public odds).
- Agents are expected to combine machine learning model-building, detect edges in public markets, and continuously adapt their strategies as conditions change over time.
- Results show that all evaluated frontier models lose money on average across five seeds, with the best model still averaging a -8% return and many runs experiencing ruin.
- Using a human-expert rubric, the study finds model strategies are generally less sophisticated than human baselines; Claude Opus 4.6 scores 26.5%, indicating substantial room for improvement, and the benchmark is released via an open-access API endpoint.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to

Automating FDA Compliance: AI for Specialty Food Producers
Dev.to