TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

arXiv cs.AI / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • TFRBench is introduced as a new benchmark that evaluates the reasoning capabilities of time-series forecasting systems rather than relying only on numerical accuracy.
  • The benchmark includes a protocol and a systematic multi-agent framework with an iterative verification loop to generate numerically grounded reasoning traces about cross-channel dependencies, trends, and external events.
  • Across ten datasets in five domains, the authors report that these generated reasoning traces are causally effective and useful for evaluation, and that prompting LLMs with them can improve forecasting accuracy (e.g., ~40.2% to ~56.6%).
  • Experiments also show that off-the-shelf LLMs often struggle with both forecasting accuracy and the ability to perform effective reasoning within this setting, frequently missing domain-specific dynamics.
  • TFRBench is positioned as a new standard for interpretable, reasoning-based evaluation in time-series forecasting, with the benchmark made publicly available online.

Abstract

We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.'' Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems--specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. \sim40.2\%\to56.6\%), validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-the-shelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting. Our benchmark is available at: https://tfrbench.github.io