Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that fine-tuning short-context pretrained LLMs for long-context use can be brittle because accuracy is highly sensitive to the absolute position of relevant evidence.
  • It introduces RoPE-Perturbed Self-Distillation, which creates multiple “views” of the same input by perturbing RoPE indices and uses self-distillation to enforce consistent predictions across those views.
  • The regularizer is designed to reduce dependence on brittle positional cues and instead encourage reliance on semantic signals.
  • Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B show measurable improvements on long-context benchmarks, including up to a 12.04% gain on RULER-64K for Llama-3-8B.
  • The approach also improves length extrapolation performance beyond the original training context window.

Abstract

Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.