LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

arXiv cs.AI / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • LudoBench is introduced as a new benchmark for evaluating LLM strategic decision-making in Ludo, a stochastic multi-agent board game with dice-based uncertainty and planning-relevant mechanics.
  • The benchmark includes 480 handcrafted spot scenarios across 12 decision categories, and it isolates specific strategic choices to make model behavior easier to interpret and diagnose.
  • The accompanying 4-player Ludo simulator supports Random, Heuristic, Game-Theory (depth-limited Expectiminimax), and LLM agents, enabling comparisons against a principled strategic baseline.
  • Experiments across six models show low alignment with the game-theory agent (only 40–46%), with models clustering into two incomplete strategy archetypes: “finishers” and “builders.”
  • Models also exhibit prompt/history sensitivity, including measurable behavioral shifts under grudge-style framing on identical board states, highlighting a vulnerability in robust reasoning under uncertainty.

Abstract

We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/