Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle
arXiv cs.LG / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper derives an exact dynamic programming oracle for infinite-shoe casino blackjack under a fixed Vegas-style ruleset, providing ground-truth action values, an optimal policy label, and an expected value of -0.00161 per hand across 4,600 decision cells.
- It assesses sample-efficient policy recovery with three model-free optimizers—masked REINFORCE with per-cell EMA baseline, SPSA, and CEM—finding REINFORCE achieves the best action-match rate (46.37%) and EV (-0.04688) after 1,000,000 hands, outperforming the others in sample efficiency.
- Despite better sample efficiency, all methods exhibit substantial cell-conditional regret, indicating persistent policy-level errors in sparse, masked-action environments even as aggregate rewards converge.
- The study shows that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum and larger wagers increase volatility without improving expected value, underscoring the need for exact oracles and negative controls to avoid mistaking stochastic variability for algorithmic performance.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA