LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
arXiv cs.CL / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- Block-wise diffusion language models reduce autoregressive costs but still suffer from memory-bound attention on long contexts due to inefficiencies in sparse attention.
- The paper identifies a KV Inflation problem in naive sparse attention for DLMs, where query-specific prefix selections cause an excessive union of KV cache pages to be loaded.
- LOSA (Locality-aware Sparse Attention) exploits the observation that most tokens change little between consecutive denoising steps, reusing cached prefix-attention for stable tokens and using sparse attention only for active ones.
- Experiments across multiple block-wise DLMs show LOSA maintains near-dense accuracy while improving efficiency, including up to +9 average accuracy points at aggressive sparsity and up to 4.14× attention speedup on RTX A6000.
- The reported gains indicate that locality and temporal stability across denoising steps can be leveraged to reduce KV loading and attention compute without large quality losses.
Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026
Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness
Dev.to

NEW PROMPT INJECTION
Dev.to