MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
arXiv cs.LG / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper identifies that long-context language modeling is increasingly bottlenecked by the memory and compute costs of maintaining and attending to large KV caches during both training and inference.
- It proposes Memory-Keyed Attention (MKA), a hierarchical attention approach that combines local, session, and long-term KV caches and dynamically learns how to route attention among them.
- It also introduces Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources prior to attention computation to reduce runtime overhead.
- Experiments report that FastMKA matches MLA’s perplexity while improving training throughput by up to 5× and reducing evaluation latency by about 1.8×, suggesting a strong accuracy–efficiency trade-off.
- The authors position MKA as a practical and extensible framework for efficient long-context attention beyond the specific variants tested.

