Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Memory Sparse Attention (MSA) targets the “long context rot” problem by using a GPU-resident, sparse index of the KV cache that points to compressed KV cache stored in system RAM.
The approach requires architectural changes (additional layers) and model training so the model can reliably retrieve KV cache from the hybrid memory setup, meaning it can’t be simply retrofitted to existing models.
The project reports training a 4B-parameter Qwen3-based model and claims support for very long contexts, citing results up to roughly 100M tokens.
Deploying the model requires a custom inference engine and serving flow (clone/compile from the provided GitHub), due to the unique model/inference architecture.

Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)

Really interesting approach to solving long context rot. Basically a hyper efficient index of KV cache is stored in the GPU's VRAM that points to compressed KV cache stored in system RAM. It requires introduction of new layers and corresponding training to get the model to retrieve the KV cache properly and achieve the long context benefits so it isn't something you can just immediately retrofit but seems like this would be worth the time to do based on the immense benefits it yields. They have a 4B qwen3 model they trained, however, you need to use their custom inference engine to serve it because of its unique architecture (clone and compile their GitHub).

https://arxiv.org/pdf/2603.23516

https://github.com/EverMind-AI/MSA

https://huggingface.co/EverMind-AI/MSA-4B

https://evermind.ai/blogs/breaking-the-100m-token-limit-msa-architecture-achieves-efficient-end-to-end-long-term-memory-for-llms

submitted by /u/ratbastid2000
[link] [comments]