AI Navigate

Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

arXiv cs.LG / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper derives an exact dynamic programming oracle for infinite-shoe casino blackjack under a fixed Vegas-style ruleset, providing ground-truth action values, an optimal policy label, and an expected value of -0.00161 per hand across 4,600 decision cells.
  • It assesses sample-efficient policy recovery with three model-free optimizers—masked REINFORCE with per-cell EMA baseline, SPSA, and CEM—finding REINFORCE achieves the best action-match rate (46.37%) and EV (-0.04688) after 1,000,000 hands, outperforming the others in sample efficiency.
  • Despite better sample efficiency, all methods exhibit substantial cell-conditional regret, indicating persistent policy-level errors in sparse, masked-action environments even as aggregate rewards converge.
  • The study shows that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum and larger wagers increase volatility without improving expected value, underscoring the need for exact oracles and negative controls to avoid mistaking stochastic variability for algorithmic performance.

Abstract

Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.