Evaluating Memory Capability in Continuous Lifelog Scenario

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces LifeDialBench, a new benchmark for evaluating memory systems in continuous lifelog scenarios, addressing the mismatch between existing chat/human-AI benchmarks and real-world ambient conversation data needs.
LifeDialBench includes two subsets—EgoMem from real-world egocentric videos and LifeMem from simulated virtual community—designed to cover complementary lifelog memory conditions.
It proposes an Online Evaluation protocol that enforces temporal causality to prevent temporal leakage and tests systems in a streaming, realistic setting.
Experimental results show that advanced memory systems do not beat a simple RAG baseline, suggesting that overly complex architectures and lossy compression can harm lifelog memory performance.
The authors release the code and data to support reproducible evaluation of memory capabilities for lifelog-based applications.

Abstract

Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.