MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
arXiv cs.AI / 4/2/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MAC-Attention, a fidelity- and access-preserving scheme to speed up long-context LLM decoding by reusing prior attention computations for semantically similar recent queries.
- MAC-Attention operates in three stages—match (pre-RoPE L2 matching in a local window), amend (recompute a small band near the match boundary), and complete (numerically stable merge with fresh attention on the KV tail).
- On “match hits,” its compute and bandwidth complexity is constant with respect to context length, aiming to address the IO-bound nature of long-context KV-cache reads.
- Experiments on LongBench v2 (120K), RULER (120K), and LongGenBench (16K) report up to 99% fewer KV accesses, 60%+ lower token generation latency at 128K, and 14.3x+ attention-phase speedups versus FlashInfer while maintaining full-attention quality.
- The method is model-agnostic and designed to work with IO-aware kernels, paged-KV managers, and MQA/GQA, with code released on GitHub.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

I Built a Local-First AI Knowledge Base for Developers — Here's What Makes It Different
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA

How to Replace Your $600/hr Contract Review with a $0.50 AI Analysis
Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse
Dev.to