MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
arXiv cs.AI / 2026/4/2
💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
要点
- The paper introduces MAC-Attention, a fidelity- and access-preserving scheme to speed up long-context LLM decoding by reusing prior attention computations for semantically similar recent queries.
- MAC-Attention operates in three stages—match (pre-RoPE L2 matching in a local window), amend (recompute a small band near the match boundary), and complete (numerically stable merge with fresh attention on the KV tail).
- On “match hits,” its compute and bandwidth complexity is constant with respect to context length, aiming to address the IO-bound nature of long-context KV-cache reads.
- Experiments on LongBench v2 (120K), RULER (120K), and LongGenBench (16K) report up to 99% fewer KV accesses, 60%+ lower token generation latency at 128K, and 14.3x+ attention-phase speedups versus FlashInfer while maintaining full-attention quality.
- The method is model-agnostic and designed to work with IO-aware kernels, paged-KV managers, and MQA/GQA, with code released on GitHub.




