Evaluating Memory Capability in Continuous Lifelog Scenario

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper introduces LifeDialBench, a new benchmark for evaluating memory systems in continuous lifelog scenarios, addressing the mismatch between existing chat/human-AI benchmarks and real-world ambient conversation data needs.
  • LifeDialBench includes two subsets—EgoMem from real-world egocentric videos and LifeMem from simulated virtual community—designed to cover complementary lifelog memory conditions.
  • It proposes an Online Evaluation protocol that enforces temporal causality to prevent temporal leakage and tests systems in a streaming, realistic setting.
  • Experimental results show that advanced memory systems do not beat a simple RAG baseline, suggesting that overly complex architectures and lossy compression can harm lifelog memory performance.
  • The authors release the code and data to support reproducible evaluation of memory capabilities for lifelog-based applications.

Abstract

Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.