Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- On-policy distillation (OPD) for large language models can suffer a “truncation collapse” failure mode where on-policy rollouts abruptly inflate in length, causing truncated trajectories to dominate training data and destabilize learning.
- The observed truncation collapse correlates with repetition saturation, producing biased gradient signals that lead to sharp validation performance degradation.
- The paper attributes the issue to a harmful interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts.
- To fix this, the authors propose StableOPD, combining a reference-based divergence constraint with rollout mixture distillation to reduce repetition-driven length inflation and stabilize training.
- Experiments across multiple math reasoning datasets show StableOPD prevents truncation collapse, stabilizes training dynamics, and improves performance by an average of 7.2% versus baseline OPD.
Related Articles
CIA is trusting AI to help analyze intel from human spies
Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table
Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.
Dev.to
Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios
Dev.to

How To Optimize Enterprise AI Energy Consumption
Dev.to