Why Model Selection Fails in Time Series Forecasting: An Empirical Study of Instability Across Data Regimes

arXiv stat.ML / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study finds that time series forecasting model selection often fails to generalize across datasets with different statistical and structural “data regimes.”
  • It proposes a descriptor-based framework using measurable properties such as trend strength, seasonality, noise level, and temporal dependence to characterize regimes.
  • A rule-based mechanism is then used to map these descriptors to candidate forecasting models, but it yields low accuracy and infrequently identifies the empirically optimal model.
  • The researchers observe strong ranking instability across different dataset characteristics and forecasting horizons, especially in noisy and mixed regimes.
  • Overall, the paper argues that static, heuristic, descriptor-driven selection cannot reliably predict forecasting performance and that more adaptive, data-driven strategies are needed.

Abstract

Time series forecasting models often exhibit inconsistent performance across datasets with varying statistical and structural properties. Despite the wide range of available forecasting techniques, it remains unclear whether model selection can be reliably guided by simple data characteristics. This paper investigates why rule-based model selection fails in time series forecasting by analyzing the relationship between data-regime descriptors and model performance. A descriptor-based framework is introduced to characterize time series using measurable properties, including trend strength, seasonality, noise level, and temporal dependence. Based on these descriptors, a rule-based selection mechanism is formulated to map data regimes to candidate forecasting models. The approach is evaluated on multiple real-world datasets across different domains and forecasting horizons. The results show that rule-based model selection achieves low accuracy, with correct model identification occurring in only a small fraction of cases. Significant discrepancies are observed between recommended and empirically optimal models, particularly in noisy and mixed regimes. Further analysis reveals that model performance is highly sensitive to both dataset characteristics and forecasting horizon, resulting in substantial ranking instability across scenarios. These findings explain why simple heuristic rules fail to generalize and demonstrate that forecasting performance cannot be reliably predicted using static, descriptor-based approaches. This study provides empirical evidence that model selection in time series forecasting is inherently context-dependent and highlights the need for more adaptive, data-driven strategies.