Evaluating Temporal Consistency in Multi-Turn Language Models
arXiv cs.CL / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper studies how multi-turn language models preserve, update, or transfer implicit time-related assumptions across dialogue turns rather than only answering single questions.
- It introduces ChronoScope, a large diagnostic benchmark with over one million deterministically generated multi-turn question chains grounded in Wikidata to test temporal scope stability.
- Evaluations on state-of-the-art models show frequent failures in temporal scope stability, where models drift toward present-day assumptions even when their underlying factual knowledge is correct.
- The violations worsen as the conversation length increases and can persist even when given oracle context, indicating a gap between single-turn accuracy and consistent temporal reasoning.
- The authors publish the dataset and evaluation suite on GitHub for public use and further research.
- categories: ["models-research"]
Related Articles

An improvement of the convergence proof of the ADAM-Optimizer
Dev.to
We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.
Reddit r/artificial
langchain-tests==1.1.7
LangChain Releases
Why isn’t LLM reasoning done in vector space instead of natural language?
Reddit r/LocalLLaMA
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged
Reddit r/LocalLLaMA