$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

arXiv cs.LG / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces π-Play, a multi-agent self-play framework that converts sparse-reward training into a dense-feedback training loop by using an intermediate artifact called a Question Construction Path (QCP).
π-Play uses the examiner to generate both tasks and QCPs, while a teacher model consumes QCP as privileged context to provide dense supervision to a student through self-distillation—without any external or labeled data.
The key insight is that self-play naturally yields QCPs, which act as high-quality privileged information derived at low cost and scalable across tasks.
Experiments report that data-free π-Play surpasses fully supervised search agents and improves evolutionary efficiency by about 2–3× compared with conventional self-play.
The work targets core training issues for deep search agents, including sparse rewards, weak credit assignment, and limited availability of labeled data, by restructuring the supervision signal.

Abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play (

\pi

-Play), a multi-agent self-evolution framework. In

\pi

-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free

\pi

-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3

\times

over conventional self-play.

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

Dev.to

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Key Points

Abstract

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer