LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces LinearARD, a self-distillation approach to restore the original model behavior when extending LLM context windows via RoPE scaling and continual pre-training (which often degrades short-context performance).
  • Instead of distilling hidden states, LinearARD supervises attention dynamics by aligning row-wise distributions of self-relation matrices for Q/Q, K/K, and V/V using a frozen native-RoPE teacher for attention-structure consistency.
  • To avoid the quadratic memory cost of n×n attention relation maps, the method uses a linear-memory kernel based on per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact KL divergence and gradients.
  • Experiments on LLaMA2-7B extended from 4K to 32K show LinearARD recovers 98.3% of short-text performance versus baselines while also improving long-context benchmarks.
  • The method is reported to reach these gains with only 4.25M training tokens, substantially less than prior approaches like LongReD and CPT (256M tokens), and the authors provide code on GitHub.

Abstract

The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense Q/Q, K/K, and V/V self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of n \times n relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration | AI Navigate