Mixture of Chapters: Scaling Learnt Memory in Transformers

arXiv cs.LG / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper addresses a key limitation of Transformers: they don’t have an explicit mechanism to store and retrieve learned knowledge beyond what’s implicitly encoded in parameters.
  • It proposes learnable sparse memory banks (latent tokens) that transformer layers access through cross-attention, enabling associative knowledge retrieval.
  • To scale memory capacity efficiently, the authors introduce “chapter-based routing,” partitioning the memory into chapters and using a trained router (MoE-inspired) to select relevant subsets per input.
  • Experiments show the approach can scale to about 262K memory tokens while keeping computation tractable, outperforming standard Transformer baselines under iso-FLOP comparisons on pre-training and instruction fine-tuning.
  • The method also appears to improve knowledge retention during continued training and reduce forgetting when moving between training phases such as pretraining to instruction tuning.

Abstract

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).