MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
arXiv cs.CL / 2026/3/26
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper introduces Memory Sparse Attention (MSA), an end-to-end trainable memory model framework designed to scale LLM-like long-term context far beyond typical full-attention limits (up to 100M tokens).
- MSA targets prior long-context approaches’ bottlenecks—precision degradation, rising latency, and limited ability to dynamically modify memory—by using scalable sparse attention and document-wise RoPE to achieve linear complexity.
- Experiments report strong stability, with less than 9% degradation when scaling from 16K to 100M tokens, suggesting practical feasibility for lifetime-scale memory use cases.
- The method adds KV-cache compression plus “Memory Parallel” to run 100M-token inference on 2×A800 GPUs, and “Memory Interleaving” to support multi-hop reasoning across separated memory segments.
- The authors claim MSA outperforms frontier LLMs, state-of-the-art RAG systems, and memory-agent approaches on long-context benchmarks, positioning it as a route to intrinsic, lifetime-scale memory.



