Short Data, Long Context: Distilling Positional Knowledge in Transformers

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that models can acquire long-context retrieval abilities via logit-based knowledge distillation without expensive long-context pre-training, even when students train only on packed short-context samples.
  • It shows that phase-wise RoPE scaling (maximizing rotational spectrum usage at each training stage) yields the strongest long-context performance in distillation setups.
  • The authors demonstrate that positional information can be transferred directly through logit-based distillation, with positional perturbations propagating from query/key vectors through transformer layers to the teacher’s output distribution.
  • Their experiments using packed repeated token sequences trace how positional effects systematically shape the distillation signal and identify structured update patterns in query states during long-context extension.

Abstract

Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher's output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.