MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

arXiv cs.AI / 2026/4/2

💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

要点

The paper introduces MAC-Attention, a fidelity- and access-preserving scheme to speed up long-context LLM decoding by reusing prior attention computations for semantically similar recent queries.
MAC-Attention operates in three stages—match (pre-RoPE L2 matching in a local window), amend (recompute a small band near the match boundary), and complete (numerically stable merge with fresh attention on the KV tail).
On “match hits,” its compute and bandwidth complexity is constant with respect to context length, aiming to address the IO-bound nature of long-context KV-cache reads.
Experiments on LongBench v2 (120K), RULER (120K), and LongGenBench (16K) report up to 99% fewer KV accesses, 60%+ lower token generation latency at 128K, and 14.3x+ attention-phase speedups versus FlashInfer while maintaining full-attention quality.
The method is model-agnostic and designed to work with IO-aware kernels, paged-KV managers, and MQA/GQA, with code released on GitHub.

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git