Temporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN

arXiv cs.CL / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reproduces experiments from DYNAMICQA and MULAN and compares their conclusions about external context on temporal facts in LLMs.
It standardizes both datasets and uses synthetic natural-language contexts to enable direct cross-benchmark comparisons.
The findings show strong dataset dependence, with MULAN's conclusions generalizing under both frameworks, while applying MULAN to DYNAMICQA yields mixed results.
It extends replication to LLMs larger than 7B, demonstrating that model size affects how temporal facts are encoded and updated.
The work emphasizes how dataset design, evaluation metrics, and model scale shape LLM behavior for resolving temporal knowledge conflicts, informing future benchmarking.

Abstract

Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN's programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN's findings generalise under both methodological frameworks, whereas applying MULAN's evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how model size influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts.