AI Navigate

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

arXiv cs.CL / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • MultiTempBench is introduced as a multilingual temporal reasoning benchmark that covers date arithmetic, time zone conversion, and temporal relation extraction across five languages and multiple calendar conventions.
  • The study evaluates 20 large language models and proposes the multilingual Date Fragmentation Ratio (mDFR) alongside geometric-probing analyses to study temporal representations.
  • It finds that tokenisation quality of temporal artefacts becomes a resource-dependent bottleneck, with fragmentation hurting accuracy in low-resource languages and rarer calendars, while high-resource settings are more robust to digit-level splitting.
  • Temporal linearity emerges as the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages, indicating different modeling priorities by resource level.
  • The authors provide code at GitHub to enable reproduction and further research on multilingual temporal reasoning benchmarks.

Abstract

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb