MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

arXiv cs.AI / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MAC-Attention, a fidelity- and access-preserving scheme to speed up long-context LLM decoding by reusing prior attention computations for semantically similar recent queries.
  • MAC-Attention operates in three stages—match (pre-RoPE L2 matching in a local window), amend (recompute a small band near the match boundary), and complete (numerically stable merge with fresh attention on the KV tail).
  • On “match hits,” its compute and bandwidth complexity is constant with respect to context length, aiming to address the IO-bound nature of long-context KV-cache reads.
  • Experiments on LongBench v2 (120K), RULER (120K), and LongGenBench (16K) report up to 99% fewer KV accesses, 60%+ lower token generation latency at 128K, and 14.3x+ attention-phase speedups versus FlashInfer while maintaining full-attention quality.
  • The method is model-agnostic and designed to work with IO-aware kernels, paged-KV managers, and MQA/GQA, with code released on GitHub.

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git