Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall

arXiv cs.CL / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that “agent memory” should not be based on extracting and storing content at ingestion time, because information discarded before the query is known cannot be recovered later.
  • It proposes “True Memory,” a six-layer retrieval-centered architecture that preserves events verbatim and replaces storage-schema assumptions with a multi-stage retrieval pipeline.
  • The full system is designed to run as a single SQLite file on commodity CPUs, avoiding external databases, vector indexes, graph stores, and GPUs.
  • Experiments show strong retrieval/recall performance: 93.0% on LoCoMo (vs. 61.4% Mem0, 65.4% Supermemory, ~71% Zep, and 94.5% EverMemOS), 87.8% on LongMemEval, and 76.6% on BEAM-1M (above a prior 73.9% result from Hindsight).
  • An ablation study across 56 configurations indicates a relatively small performance variance (about 1.3 percentage points) within the best-performing family of setups.

Abstract

Extraction at ingestion is the wrong primitive for agent memory: content discarded before the query is known cannot be recovered at retrieval time. We propose True Memory, a six-layer architecture that shifts the center of the system from a storage schema to a multi-stage retrieval pipeline operating over events preserved verbatim. The full system runs as a single SQLite file on commodity CPU with no external database, vector index, graph store, or GPU. On LoCoMo (1,540 questions across 10 multi-session conversations), True Memory Pro reaches 93.0% accuracy (3-run mean) against 61.4% for Mem0, 65.4% for Supermemory, approximately 71% for Zep, and 94.5% for EverMemOS under a matched gpt-4.1-mini answer model. On LongMemEval (500 questions), True Memory Pro reaches 87.8% (3-run mean). On BEAM-1M (700 questions at the 1-million-token scale), True Memory Pro reaches 76.6% (3-run mean), above the prior published result of 73.9% for Hindsight. A 56-configuration ablation shows a 1.3-percentage-point spread within the top-performing configuration family.