Mixture of Chapters: Scaling Learnt Memory in Transformers
arXiv cs.LG / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a key limitation of Transformers: they don’t have an explicit mechanism to store and retrieve learned knowledge beyond what’s implicitly encoded in parameters.
- It proposes learnable sparse memory banks (latent tokens) that transformer layers access through cross-attention, enabling associative knowledge retrieval.
- To scale memory capacity efficiently, the authors introduce “chapter-based routing,” partitioning the memory into chapters and using a trained router (MoE-inspired) to select relevant subsets per input.
- Experiments show the approach can scale to about 262K memory tokens while keeping computation tractable, outperforming standard Transformer baselines under iso-FLOP comparisons on pre-training and instruction fine-tuning.
- The method also appears to improve knowledge retention during continued training and reduce forgetting when moving between training phases such as pretraining to instruction tuning.
Related Articles
How AI is Transforming Dynamics 365 Business Central
Dev.to
Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm
Reddit r/artificial
Do I need different approaches for different types of business information errors?
Dev.to
ShieldCortex: What We Learned Protecting AI Agent Memory
Dev.to
WordPress Theme Customization Without Code: The AI Revolution
Dev.to