HorizonBench: Long-Horizon Personalization with Evolving Preferences

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces “long-horizon personalization,” where a user’s preferences evolve over months and systems must detect when a stated preference is overridden by subsequent life events.
  • It addresses a major gap in existing resources by creating HorizonBench, a new benchmark with naturalistic 6-month conversational histories plus ground-truth provenance for every preference change.
  • HorizonBench is built from a structured mental state graph–based data generator, providing 4,245 benchmark items from 360 simulated users, averaging ~4,300 turns and ~163K tokens per history.
  • Evaluations on 25 frontier models show poor performance: the best model achieves 52.8% while most are at or below a 20% chance baseline, and many errors reflect failure to update beliefs about evolved preferences.
  • The study finds that models frequently revert to the user’s originally stated value (without tracking updates), and this belief-update/state-tracking shortcoming persists across context lengths and different levels of explicit preference expression.

Abstract

User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user's originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.