Match or Replay: Self Imitating Proximal Policy Optimization
arXiv cs.LG / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a self-imitating on-policy reinforcement learning algorithm (Match or Replay) aimed at improving exploration and sample efficiency, especially under sparse rewards.
- It uses past high-reward state-action pairs to steer policy updates, prioritizing trajectories via optimal transport in dense-reward settings.
- In sparse-reward environments, the method uniformly replays successful self-encountered trajectories to promote more structured exploration.
- Experiments on MuJoCo (dense rewards), 3D Animal-AI Olympics (partially observable sparse rewards), and multi-goal PointMaze show faster convergence and higher success rates than existing self-imitating RL baselines.
- The authors argue the approach is a robust exploration strategy for RL that could generalize to more complex tasks.



