MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

arXiv cs.AI / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MAC-Attention, a fidelity- and access-preserving scheme to speed up long-context LLM decoding by reusing prior attention computations for semantically similar recent queries.
MAC-Attention operates in three stages—match (pre-RoPE L2 matching in a local window), amend (recompute a small band near the match boundary), and complete (numerically stable merge with fresh attention on the KV tail).
On “match hits,” its compute and bandwidth complexity is constant with respect to context length, aiming to address the IO-bound nature of long-context KV-cache reads.
Experiments on LongBench v2 (120K), RULER (120K), and LongGenBench (16K) report up to 99% fewer KV accesses, 60%+ lower token generation latency at 128K, and 14.3x+ attention-phase speedups versus FlashInfer while maintaining full-attention quality.
The method is model-agnostic and designed to work with IO-aware kernels, paged-KV managers, and MQA/GQA, with code released on GitHub.

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/2DailyView insight →

I Built a Local-First AI Knowledge Base for Developers — Here's What Makes It Different

Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms

Dev.to

A bug in Bun may have been the root cause of the Claude Code source code leak.

Reddit r/LocalLLaMA

How to Replace Your $600/hr Contract Review with a $0.50 AI Analysis

Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

Dev.to

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Key Points

Abstract

💡 Insights using this article

Related Articles

I Built a Local-First AI Knowledge Base for Developers — Here's What Makes It Different

Benchmarking Batch Deep Reinforcement Learning Algorithms

A bug in Bun may have been the root cause of the Claude Code source code leak.

How to Replace Your $600/hr Contract Review with a $0.50 AI Analysis

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer