SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games

arXiv cs.AI / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that MuZero-style model-based RL, while effective in perfect-information games, struggles in partially observable stochastic multi-player settings because its latent encoding lacks an explicit way to represent uncertainty over hidden state.
It introduces SkyNet (Belief-Aware MuZero), which keeps the standard MuZero architecture but adds ego-conditioned auxiliary heads for winner prediction and rank estimation to make latent states more outcome-predictive under partial observability.
The authors evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using transformer-based encoding, decision-granularity, and a curriculum with self-play against heuristic opponents.
In matched 1000-game head-to-head evaluations, SkyNet reaches a 75.3% peak win rate versus the baseline, corresponding to a +194 Elo improvement, and also shows large gains over heuristic opponents.
The study finds SkyNet initially underperforms due to training dynamics but surpasses the baseline once training throughput is sufficient, indicating that belief-aware auxiliary supervision improves representations given adequate data flow.

Abstract

In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little work has extended MuZero to partially observable, stochastic, multi-player environments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games but in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero's latent encoding has no dedicated mechanism for representing uncertainty over unobserved variables. To address this, we introduce SkyNet (Belief-Aware MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architecture. These objectives encourage the latent state to retain information predictive of outcomes under partial observability, without requiring explicit belief-state tracking or changes to the search algorithm. We evaluate SkyNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity environment, transformer-based encoding, and a curriculum of heuristic opponents with self-play. In 1000-game head-to-head evaluations at matched checkpoints, SkyNet achieves a 75.3% peak win rate against the baseline (+194 Elo,

p < 10^{-50}

). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs.\ 0.466 win rate). Critically, the belief-aware model initially underperforms the baseline but decisively surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability, but only given adequate data flow.