$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

arXiv cs.LG / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces π-Play, a multi-agent self-play framework that converts sparse-reward training into a dense-feedback training loop by using an intermediate artifact called a Question Construction Path (QCP).
  • π-Play uses the examiner to generate both tasks and QCPs, while a teacher model consumes QCP as privileged context to provide dense supervision to a student through self-distillation—without any external or labeled data.
  • The key insight is that self-play naturally yields QCPs, which act as high-quality privileged information derived at low cost and scalable across tasks.
  • Experiments report that data-free π-Play surpasses fully supervised search agents and improves evolutionary efficiency by about 2–3× compared with conventional self-play.
  • The work targets core training issues for deep search agents, including sparse rewards, weak credit assignment, and limited availability of labeled data, by restructuring the supervision signal.

Abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play (\pi-Play), a multi-agent self-evolution framework. In \pi-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free \pi-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3\times over conventional self-play.