SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

arXiv cs.LG / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SCOPE addresses a key limitation in On-Policy Distillation by calibrating token-level KL supervision according to the quality of on-policy signals rather than applying uniform weighting across rollouts.
The method splits rollouts into two paths: incorrect trajectories receive teacher-perplexity-weighted KL distillation to emphasize cases where the teacher can reliably correct, while correct trajectories use student-perplexity-weighted MLE to focus learning on borderline, low-confidence examples.
SCOPE further stabilizes learning via group-level normalization that adjusts weight distributions across prompts with varying intrinsic difficulty.
Experiments on six reasoning benchmarks report consistent gains, including an average relative improvement of 11.42% on Avg@32 and 7.30% on Pass@32 versus competitive baselines.
Overall, the paper proposes a training-time routing and adaptive weighting strategy to improve reasoning alignment under sparse, outcome-level rewards typical of on-policy RL setups.

Abstract

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

Small NSFW model for chatbot

Reddit r/LocalLLaMA

ChatGPT for Nurses: Prompts That Help You Document, Communicate, and Study

Dev.to

I Added a Stopwatch to My AI in 1 LOC Using the Livingrimoire While Corporations Need a Year

Dev.to

Built tasuki — an AI CLI Orchestrator that Seamlessly Hands Off Between Tools

Dev.to

I built a GNOME extension for Codex with local/remote history, live filters, Markdown export, and a read-only MCP server

Reddit r/artificial

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Key Points

Abstract

Related Articles

Small NSFW model for chatbot

ChatGPT for Nurses: Prompts That Help You Document, Communicate, and Study

I Added a Stopwatch to My AI in 1 LOC Using the Livingrimoire While Corporations Need a Year

Built tasuki — an AI CLI Orchestrator that Seamlessly Hands Off Between Tools

I built a GNOME extension for Codex with local/remote history, live filters, Markdown export, and a read-only MCP server

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer