Peer-Predictive Self-Training for Language Model Reasoning

arXiv cs.AI / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Peer-Predictive Self-Training (PST), a label-free self-improvement method where multiple language models collaborate using a cross-model aggregated answer as an internal training target.
During sequential response generation, PST quantifies how informative each intermediate response is about the final aggregate using pointwise mutual information (PMI), and scales fine-tuning updates accordingly.
The method updates models less when their responses are already aligned with the aggregate and more when responses are less informative or misaligned, aiming to sharpen reasoning consistency.
Experiments on mathematical reasoning benchmarks (SimulEq, Math500, MultiArith) show exact-match accuracy gains of 2.2–4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B.
PST also reduces the generator–verifier gap (GV-Gap) by 26–40% and requires no external supervision, indicating cross-model peer feedback can be an effective self-supervised training approach.

Abstract

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.