AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • AlpsBench is proposed as a gold-standard evaluation benchmark for LLM personalization, designed to reflect real-world human–LLM dialogues rather than synthetic dialogue data.
  • The benchmark includes 2,500 long-term interaction sequences from WildChat, along with human-verified structured memories capturing explicit and implicit personalization signals.
  • It defines a full memory-management lifecycle with four tasks—information extraction, updating, retrieval, and utilization—along with evaluation protocols for end-to-end performance.
  • Experiments on frontier LLMs and memory-centric systems show persistent weaknesses in extracting latent user traits, a ceiling in memory updating performance, and retrieval degradation with large distractor pools.
  • The study finds that adding explicit memory mechanisms can improve recall but does not automatically lead to more preference-aligned or emotionally resonant responses.

Abstract

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.