Mixture of Chapters: Scaling Learnt Memory in Transformers
arXiv cs.LG / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper addresses a key limitation of Transformers: they don’t have an explicit mechanism to store and retrieve learned knowledge beyond what’s implicitly encoded in parameters.
- It proposes learnable sparse memory banks (latent tokens) that transformer layers access through cross-attention, enabling associative knowledge retrieval.
- To scale memory capacity efficiently, the authors introduce “chapter-based routing,” partitioning the memory into chapters and using a trained router (MoE-inspired) to select relevant subsets per input.
- Experiments show the approach can scale to about 262K memory tokens while keeping computation tractable, outperforming standard Transformer baselines under iso-FLOP comparisons on pre-training and instruction fine-tuning.
- The method also appears to improve knowledge retention during continued training and reduce forgetting when moving between training phases such as pretraining to instruction tuning.

