HorizonBench: Long-Horizon Personalization with Evolving Preferences
arXiv cs.CL / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces “long-horizon personalization,” where a user’s preferences evolve over months and systems must detect when a stated preference is overridden by subsequent life events.
- It addresses a major gap in existing resources by creating HorizonBench, a new benchmark with naturalistic 6-month conversational histories plus ground-truth provenance for every preference change.
- HorizonBench is built from a structured mental state graph–based data generator, providing 4,245 benchmark items from 360 simulated users, averaging ~4,300 turns and ~163K tokens per history.
- Evaluations on 25 frontier models show poor performance: the best model achieves 52.8% while most are at or below a 20% chance baseline, and many errors reflect failure to update beliefs about evolved preferences.
- The study finds that models frequently revert to the user’s originally stated value (without tracking updates), and this belief-update/state-tracking shortcoming persists across context lengths and different levels of explicit preference expression.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

A practical guide to getting comfortable with AI coding tools
Dev.to

Competitive Map: 10 AI Agent Platforms vs AgentHansa
Dev.to

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to