PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

arXiv cs.AI / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PERMA, a benchmark for evaluating long-term personalized memory agents by testing how well models maintain persona consistency over temporally ordered, multi-session interactions rather than relying on static preference recall.
  • It addresses limitations of prior evaluations that mix preference-related dialogue with irrelevant conversation, by modeling how user preferences gradually emerge and accumulate across noisy contexts.
  • PERMA incorporates simulated real-world input variability and linguistic alignment (idiolects) using temporally evolving event sequences with preference queries inserted over time.
  • The benchmark includes both multiple-choice and interactive tasks to measure a model’s ability to track preferences along an interaction timeline, across multiple domains.
  • Experiments suggest that event-linked memory systems can recover more precise preferences and reduce token usage compared with semantic retrieval, but still struggle with long-horizon persona coherence and cross-domain interference.

Abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. However, prior evaluations typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events that drive the evolution of user preferences. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems can extract more precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments | AI Navigate