Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle
arXiv cs.LG / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper derives an exact dynamic programming oracle for infinite-shoe casino blackjack under a fixed Vegas-style ruleset, providing ground-truth action values, an optimal policy label, and an expected value of -0.00161 per hand across 4,600 decision cells.
- It assesses sample-efficient policy recovery with three model-free optimizers—masked REINFORCE with per-cell EMA baseline, SPSA, and CEM—finding REINFORCE achieves the best action-match rate (46.37%) and EV (-0.04688) after 1,000,000 hands, outperforming the others in sample efficiency.
- Despite better sample efficiency, all methods exhibit substantial cell-conditional regret, indicating persistent policy-level errors in sparse, masked-action environments even as aggregate rewards converge.
- The study shows that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum and larger wagers increase volatility without improving expected value, underscoring the need for exact oracles and negative controls to avoid mistaking stochastic variability for algorithmic performance.
Related Articles

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)
Dev.to

SYNCAI
Dev.to
How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024
Dev.to
When AI Grows Up: Identity, Memory, and What Persists Across Versions
Dev.to
AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)
Dev.to