LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

arXiv cs.CL / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Block-wise diffusion language models reduce autoregressive costs but still suffer from memory-bound attention on long contexts due to inefficiencies in sparse attention.
  • The paper identifies a KV Inflation problem in naive sparse attention for DLMs, where query-specific prefix selections cause an excessive union of KV cache pages to be loaded.
  • LOSA (Locality-aware Sparse Attention) exploits the observation that most tokens change little between consecutive denoising steps, reusing cached prefix-attention for stable tokens and using sparse attention only for active ones.
  • Experiments across multiple block-wise DLMs show LOSA maintains near-dense accuracy while improving efficiency, including up to +9 average accuracy points at aggressive sparsity and up to 4.14× attention speedup on RTX A6000.
  • The reported gains indicate that locality and temporal stability across denoising steps can be leveraged to reduce KV loading and attention compute without large quality losses.

Abstract

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.