Mixture of Chapters: Scaling Learnt Memory in Transformers

arXiv cs.LG / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key limitation of Transformers: they don’t have an explicit mechanism to store and retrieve learned knowledge beyond what’s implicitly encoded in parameters.
It proposes learnable sparse memory banks (latent tokens) that transformer layers access through cross-attention, enabling associative knowledge retrieval.
To scale memory capacity efficiently, the authors introduce “chapter-based routing,” partitioning the memory into chapters and using a trained router (MoE-inspired) to select relevant subsets per input.
Experiments show the approach can scale to about 262K memory tokens while keeping computation tractable, outperforming standard Transformer baselines under iso-FLOP comparisons on pre-training and instruction fine-tuning.
The method also appears to improve knowledge retention during continued training and reduce forgetting when moving between training phases such as pretraining to instruction tuning.

Abstract

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

WordPress Theme Customization Without Code: The AI Revolution

Dev.to

Mixture of Chapters: Scaling Learnt Memory in Transformers

Key Points

Abstract

Related Articles

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

WordPress Theme Customization Without Code: The AI Revolution

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer