Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

arXiv cs.CL / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes POP, a self-play post-training framework that extends LLM self-play beyond verifiable tasks like math and coding to open-ended tasks.
POP uses the same LLM to generate task-specific evaluation rubrics and corresponding input-output examples, then applies the rubric to score model outputs for training.
To make the self-play signal more reliable, the method leverages a content-rich pretraining corpus to reduce reward hacking via a generation-verification gap and to prevent mode collapse.
Experiments on Qwen-2.5-7B show POP improves performance for both pretrained and instruction-tuned variants across diverse domains, including long-form Healthcare QA and creative writing.
The approach targets the data bottleneck in post-training by reducing reliance on costly human-written pairs or expensive proprietary labeling/reward models.

Abstract

Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Dev.to

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Dev.to

Building Agent Arena: Using Valkey as the Nervous System for Multi-Agent AI

Dev.to

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

Dev.to

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Key Points

Abstract

Related Articles

Why don't Automatic speech Recognition models use prompting? [D]

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

My AI Agent Over-Corrected Itself — So I Built Metabolic Regulation

Building Agent Arena: Using Valkey as the Nervous System for Multi-Agent AI

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer