PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

arXiv cs.LG / 4/30/2026

📰 NewsModels & Research

Key Points

  • The paper introduces PAINT (Partial-solution Adaptive Interpolated Training) to improve LLM reasoning by supplying training supervision that matches the model’s own test-time reasoning states and token-level signals.
  • PAINT reframes privileged on-policy self-distillation as contextual re-scoring, focusing on how much verified solution context to reveal and how that context’s distribution shapes student behavior.
  • It masks verified solutions based on rollout-reference overlap and performs energy-space interpolation at selected token positions where entropy mismatches occur.
  • Experiments on competition-level math benchmarks show PAINT improves over a strong on-policy self-distillation baseline across three Qwen3 model scales.
  • For Qwen3-8B, PAINT increases Macro Avg@12 by 2.1 points versus the prior baseline and by 2.9 points versus GRPO.

Abstract

Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the verified solution according to rollout-reference overlap and applies a small energy-space interpolation on a sparse set of entropy-mismatch token positions. Across competition-level math benchmarks, PAINT consistently improves over a strong prior on-policy self-distillation baseline at all three Qwen3 scales. On Qwen3-8B, it raises macro Avg@12 by 2.1 points over this prior baseline and 2.9 points over GRPO.