Short Data, Long Context: Distilling Positional Knowledge in Transformers
arXiv cs.CL / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that models can acquire long-context retrieval abilities via logit-based knowledge distillation without expensive long-context pre-training, even when students train only on packed short-context samples.
- It shows that phase-wise RoPE scaling (maximizing rotational spectrum usage at each training stage) yields the strongest long-context performance in distillation setups.
- The authors demonstrate that positional information can be transferred directly through logit-based distillation, with positional perturbations propagating from query/key vectors through transformer layers to the teacher’s output distribution.
- Their experiments using packed repeated token sequences trace how positional effects systematically shape the distillation signal and identify structured update patterns in query states during long-context extension.
Related Articles

Black Hat Asia
AI Business

The enforcement gap: why finding issues was never the problem
Dev.to

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool
Dev.to

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises
Dev.to

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently
Dev.to