Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall

arXiv cs.CL / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that “agent memory” should not be based on extracting and storing content at ingestion time, because information discarded before the query is known cannot be recovered later.
It proposes “True Memory,” a six-layer retrieval-centered architecture that preserves events verbatim and replaces storage-schema assumptions with a multi-stage retrieval pipeline.
The full system is designed to run as a single SQLite file on commodity CPUs, avoiding external databases, vector indexes, graph stores, and GPUs.
Experiments show strong retrieval/recall performance: 93.0% on LoCoMo (vs. 61.4% Mem0, 65.4% Supermemory, ~71% Zep, and 94.5% EverMemOS), 87.8% on LongMemEval, and 76.6% on BEAM-1M (above a prior 73.9% result from Hindsight).
An ablation study across 56 configurations indicates a relatively small performance variance (about 1.3 percentage points) within the best-performing family of setups.

Abstract

Extraction at ingestion is the wrong primitive for agent memory: content discarded before the query is known cannot be recovered at retrieval time. We propose True Memory, a six-layer architecture that shifts the center of the system from a storage schema to a multi-stage retrieval pipeline operating over events preserved verbatim. The full system runs as a single SQLite file on commodity CPU with no external database, vector index, graph store, or GPU. On LoCoMo (1,540 questions across 10 multi-session conversations), True Memory Pro reaches 93.0% accuracy (3-run mean) against 61.4% for Mem0, 65.4% for Supermemory, approximately 71% for Zep, and 94.5% for EverMemOS under a matched gpt-4.1-mini answer model. On LongMemEval (500 questions), True Memory Pro reaches 87.8% (3-run mean). On BEAM-1M (700 questions at the 1-million-token scale), True Memory Pro reaches 76.6% (3-run mean), above the prior published result of 73.9% for Hindsight. A 56-configuration ablation shows a 1.3-percentage-point spread within the top-performing configuration family.