SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

arXiv cs.LG / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • SCOPE addresses a key limitation in On-Policy Distillation by calibrating token-level KL supervision according to the quality of on-policy signals rather than applying uniform weighting across rollouts.
  • The method splits rollouts into two paths: incorrect trajectories receive teacher-perplexity-weighted KL distillation to emphasize cases where the teacher can reliably correct, while correct trajectories use student-perplexity-weighted MLE to focus learning on borderline, low-confidence examples.
  • SCOPE further stabilizes learning via group-level normalization that adjusts weight distributions across prompts with varying intrinsic difficulty.
  • Experiments on six reasoning benchmarks report consistent gains, including an average relative improvement of 11.42% on Avg@32 and 7.30% on Pass@32 versus competitive baselines.
  • Overall, the paper proposes a training-time routing and adaptive weighting strategy to improve reasoning alignment under sparse, outcome-level rewards typical of on-policy RL setups.

Abstract

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.