ProxyAttn: Guided Sparse Attention via Representative Heads
arXiv cs.CL / 4/1/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper addresses the quadratic compute cost of attention in LLMs on long-text tasks by targeting more efficient sparse attention.
- It introduces ProxyAttn, a training-free method that improves block importance estimation by exploiting similarity across attention heads and using pooled representative heads as proxies.
- ProxyAttn also adds a block-aware dynamic budget estimation to handle differing sparsity needs among heads, aiming for finer-grained sparsity decisions at low overhead.
- Experiments across multiple mainstream models and benchmarks report up to 10.3× attention acceleration and 2.4× prefilling acceleration with minimal performance loss compared with existing approaches.
- The authors release code publicly, positioning ProxyAttn as an immediately usable research contribution for accelerating long-context LLM workloads.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




