AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

AlpsBench is proposed as a gold-standard evaluation benchmark for LLM personalization, designed to reflect real-world human–LLM dialogues rather than synthetic dialogue data.
The benchmark includes 2,500 long-term interaction sequences from WildChat, along with human-verified structured memories capturing explicit and implicit personalization signals.
It defines a full memory-management lifecycle with four tasks—information extraction, updating, retrieval, and utilization—along with evaluation protocols for end-to-end performance.
Experiments on frontier LLMs and memory-centric systems show persistent weaknesses in extracting latent user traits, a ceiling in memory updating performance, and retrieval degradation with large distractor pools.
The study finds that adding explicit memory mechanisms can improve recall but does not automatically lead to more preference-aligned or emotionally resonant responses.

Abstract

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.