Abstract
We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for 84{,}000 episodes achieves only 26.8\% win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting.
After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from 73.5\% to 21.6\%. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near 50\% throughout the collapse.
We propose a minimal intervention -- opponent mixing, where 20\% of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent -- which mitigates competitive overfitting and restores generalisation to 77.1\% (\pm 12.6\%, 10 seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.