Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

arXiv cs.CL / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles the long-horizon problem in LLM chats by introducing “cooperative paging,” which replaces evicted context segments with compact keyword bookmarks and enables the model to call a recall() tool to fetch full content when needed.
On the LoCoMo benchmark (10 real, multi-session conversations; 300+ turns), cooperative paging delivers the best answer quality among six tested methods, with results validated by independent LLM judges (p=0.017).
An ablation study finds that fixed-size, coarse paging works far better than certain content-aware boundary strategies, and that eviction policy effectiveness depends on the data domain (FIFO for synthetic, LFU for LoCoMo).
Two bookmark generation strategies improve end-to-end performance over a heuristic baseline, but the key remaining limitation is bookmark discrimination: recall is triggered often, yet the correct page is selected only about 57% when bookmarks lack distinctiveness.
The study concludes that bookmark specificity is crucial, accounting for roughly a 25 percentage-point accuracy gap in selecting the right evicted segment.

Abstract

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges (

p=0.017

, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/15DailyView insight →

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026

Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

Dev.to

NEW PROMPT INJECTION

Dev.to

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Key Points

Abstract

💡 Insights using this article

Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

How AI Interview Assistants Are Changing Job Preparation in 2026

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness

NEW PROMPT INJECTION

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer