CL-bench Life: Can Language Models Learn from Real-Life Context?

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper highlights that as AI assistants move from professional environments to everyday life, they must learn from messy, fragmented, and experience-linked real-world contexts (e.g., group chats, personal archives, behavioral traces).
To evaluate this capability, the authors introduce CL-bench Life, a human-curated benchmark with 405 context-task pairs and 5,348 verification rubrics covering common real-life scenarios.
Experiments on ten frontier language models show real-life context learning is still highly challenging, with the best model reaching only a 19.3% task-solving rate and the average at 13.8%.
The results indicate persistent difficulty in reasoning over complex real-life information sources such as disordered multi-party conversation histories and fragmented behavioral records.
CL-bench Life is positioned as a testbed to drive improvements toward more reliable AI assistants for everyday use.

Abstract

Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks. We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life. CL-bench Life provides a crucial testbed for advancing real-life context learning, and progress on it can enable more intelligent and reliable AI assistants in everyday life.