KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces KellyBench, a new benchmark aimed at evaluating long-horizon sequential decision making in non-stationary, open-ended settings rather than narrow procedural tasks where benchmarks are already saturated.
  • KellyBench simulates the 2023–24 English Premier League season in a sequential sports-betting environment, challenging agents to maximize long-term bankroll growth using detailed historical data (advanced stats, lineups, and public odds).
  • Agents are expected to combine machine learning model-building, detect edges in public markets, and continuously adapt their strategies as conditions change over time.
  • Results show that all evaluated frontier models lose money on average across five seeds, with the best model still averaging a -8% return and many runs experiencing ruin.
  • Using a human-expert rubric, the study finds model strategies are generally less sophisticated than human baselines; Claude Opus 4.6 scores 26.5%, indicating substantial room for improvement, and the benchmark is released via an open-access API endpoint.

Abstract

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at https://openreward.ai/GeneralReasoning/KellyBench.