Zero Shot Coordination for Sparse Reward Tasks with Diverse Reward Shapings

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles Zero-Shot Coordination (ZSC) in multi-agent reinforcement learning, where agents must cooperate with previously unseen partners trained with similar objectives but different seeds, algorithms, or training setups.
  • Prior ZSC approaches typically assume identical reward functions across trained agents and future partners, which the authors argue is unrealistic for sparse-reward tasks.
  • To make ZSC robust to different reward shaping, the authors propose training an ensemble of methods using randomized reward shapings selected via four different selection algorithms.
  • Experiments in the Overcooked environment show substantial gains—62.2% to 119.2% improvement in sparse reward versus baseline ZSC methods—when partners share sparse objectives but differ in how rewards are shaped.

Abstract

Many Multi-Agent Reinforcement Learning (MARL) agents fail to adapt properly to cooperating with agents trained with the same objectives but different seeds, algorithms, or other training differences. This is the problem of Zero-Shot Coordination (ZSC), which focuses on training agents to cooperate well with unknown agents. ZSC has been studied for a variety of tabular cases and simple games such as Hanabi, achieving excellent results. However, existing solutions to ZSC only consider identical rewards for your trained agents and all future partners. This is not realistic for the trained agents, as they do not consider the problem of cooperating with agents that have identical sparse objectives but shape the rewards for those objectives in different manner. To address this issue, we show how to train an ensemble of methods using randomized reward shapings chosen using 4 selection algorithms. Experiments done on the Overcooked environment demonstrate consistent improvements of 62.2%-119.2% in sparse reward over baseline ZSC algorithms when playing with agents that have identical sparse rewards but different reward shapings.