Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

arXiv cs.LG / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Territory Paint Wars is a newly released Unity-based competitive multi-agent RL benchmark used to study how PPO can fail under self-play in a symmetric zero-sum setting.
The study finds five key PPO/implementation failure modes—reward-scale imbalance, missing terminal signals, poor long-horizon credit assignment, unnormalized observations, and faulty win detection—that can severely reduce win rates even when training runs long enough.
After fixing those issues, the authors identify a separate emergent problem, competitive overfitting, where self-play win rates look stable while generalization collapses dramatically.
The paper shows that standard self-play metrics may not detect competitive overfitting because both agents co-adapt, keeping self-play performance near chance.
A minimal mitigation—opponent mixing (replacing 20% of training episodes with a fixed uniformly-random opponent)—substantially restores generalization without requiring population-based training or extra infrastructure, and the benchmark is open-sourced for reproducibility.

Abstract

We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for

84{,}000

episodes achieves only

26.8\%

win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from

73.5\%

21.6\%

. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near

50\%

throughout the collapse. We propose a minimal intervention -- opponent mixing, where

20\%

of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent -- which mitigates competitive overfitting and restores generalisation to

77.1\%

(

\pm 12.6\%

10

seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.

Black Hat Asia

AI Business

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Every AI Agent Registry in 2026, Compared

Dev.to

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Key Points

Abstract

Related Articles

Black Hat Asia

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Every AI Agent Registry in 2026, Compared

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer