QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

arXiv cs.LG / 3/30/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces QuitoBench, a regime-balanced open benchmark for time series forecasting covering eight trend×seasonality×forecastability (TSF) regimes to better reflect forecasting-relevant properties than domain labels.
Using Quito, a billion-scale time series traffic corpus from Alipay across nine business domains, the authors evaluate 10 forecasting models (including deep learning, foundation models, and statistical baselines) on 232,200 evaluation instances.
Results show a context-length crossover: deep learning models outperform at short context lengths (L=96) while foundation models lead at long contexts (L≥576).
Forecastability is identified as the dominant difficulty factor, yielding a 3.64× MAE gap across regimes, and deep learning achieves similar or better performance than foundation models using 59× fewer parameters.
The study finds that increasing training data produces larger gains than increasing model size for both deep learning and foundation model families, and the authors release the benchmark for reproducible, regime-aware research.

Abstract

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend

\times

seasonality

\times

forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (

L=96

) but foundation models dominate at long context (

L \ge 576

); (ii) forecastability is the dominant difficulty driver, producing a

3.64 \times

MAE gap across regimes; (iii) deep learning models match or surpass foundation models at

59 \times

fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer