SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 既存のTransformer系LLMは自己注意の二乗計算量が長文推論を高コスト化するが、Sliding Window Attention(SWA)は線形に近い計算で効率化できる一方、長文性能が崩壊する問題がある。
  • 崩壊の原因として、(1) Full Attention(FA)で事前学習されたモデルにSWAを素朴に適用することで起きる学習-推論ミスマッチ、(2) SWAを全モジュールで常時適用することによる遠距離情報への到達構造的制約の二点を挙げている。
  • 提案手法Sliding Window Attention Adaptation(SWAA)は、事前学習をコスト高にしない「プラグ&プレイ」なレシピ群として、FA/SWA層のインターリーブ、sinkトークン保持、軽量ファインチューニングなど4つの戦略を組み合わせる。
  • 実験では単一戦略では不十分でも、相乗的な組み合わせにより長文性能を回復でき、計算オーバーヘッドが変動する条件下でも最適構成を分析して効率性と品質のトレードオフを示している。

Abstract

The quadratic complexity of self attention in Transformer based LLMs renders long context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear complexity alternative, it suffers from catastrophic long context performance collapse, which stems from two fundamental factors: the training inference mismatch when naively applying SWA to models pretrained with Full Attention (FA), and the inherent structural inability to access distant information when applying SWA to every module at all times. To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining. SWAA systematically combines four core strategies to tackle these distinct issues: (1) Full Attention (FA) Decode and (2) Interleaving FA and SWA layers, which mitigate structural defects by selectively allowing access to distant information; alongside (3) preserving ``sink'' tokens and (4) lightweight fine tuning, which mitigate the training inference mismatch. Our experiments reveal that while isolated strategies are insufficient, specific synergistic combinations effectively recover long context performance. Despite varying computational overheads, our performance efficiency trade off analysis identifies optimal SWAA configurations for diverse scenarios, achieving 30% to 100% speedups for long context inference with acceptable quality retention. Our code, data and model weights are available at https://github.com/yuyijiong/sliding-window-attention-adaptation