Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game
arXiv cs.LG / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper explores adapting AlphaZero-style self-play reinforcement learning to Tablut, a highly asymmetric board game with different objectives and unequal piece counts for attacker vs. defender.
- It argues that the standard AlphaZero single policy/value head setup struggles in asymmetric settings because it forces the network to learn conflicting evaluation functions.
- To improve learning, the authors modify the architecture to use separate policy and value heads per player role while sharing a residual trunk to capture common board representations.
- Training proved unstable due to catastrophic forgetting between roles, and the study mitigates this with C4 data augmentation, a larger replay buffer, and a checkpoint-mixing strategy (playing 25% of games versus past checkpoints).
- After 100 self-play iterations, the model shows steady improvement, reaching a BayesElo of 1235 versus a random-initialization baseline, with training metrics indicating more decisive play as policy entropy drops.
Related Articles

Black Hat Asia
AI Business
[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project
Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing
Dev.to

Every AI Agent Registry in 2026, Compared
Dev.to