MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The MEMAUDIT protocol proposes a new way to evaluate long-term LLM memory writing by treating the write-time memory selection as a finite, auditable optimization problem under an explicit storage budget.
  • It decouples memory-writing evaluation from end-to-end question answering so that representation quality, validity-state preservation, and budget-aware selection effects can be measured separately.
  • A MEMAUDIT “package” fully specifies the experience stream, candidate memory representations, storage costs, semantic evidence units, future-query requirements, and the budget, enabling an exact evaluation with a certified denominator.
  • The authors instantiate MEMAUDIT with a concave-over-modular semantic coverage objective and enforce constraints like one representation per experience, then compute exact optima using branch-and-bound with MILP certification.
  • The paper releases reusable package generators, certified solvers, natural package exports, external-system scorers, and cached reproducibility metadata, including exported stores such as Mem0, A-Mem, and Letta.

Abstract

Long-term LLM agents must compress streams of past interactions into persistent memory before future queries are known. Existing evaluations usually measure final question-answering accuracy, which entangles memory writing with retrieval, prompting, and reader reasoning. We introduce MEMAUDIT, an exact packageoracle evaluation protocol for budgeted long-term memory writing. A MEMAUDIT package fixes an experience stream, candidate memory representations, storage costs, semantic evidence units, future-query requirements, and a budget, turning write-time memory selection into a finite auditable optimization problem with a certified denominator. We instantiate this protocol with a concave-over-modular semantic coverage objective under storage and one-representation-per-experience constraints, and compute exact package optima using branch-and-bound with MILP certification. Across controlled exact packages, validity-heavy stress tests, human-audited natural support slices, and exported Mem0, A-Mem, and Letta stores, MEMAUDIT separates representation quality, validity-state preservation, and budget-aware selection effects that end-to-end QA cannot localize. The resulting artifact provides reusable package generators, certified solvers, natural package exports, external-system scorers, and cached reproducibility metadata for evaluating what memory writers actually preserve under fixed storage budgets.