Training-Free Probabilistic Time-Series Forecasting with Conformal Seasonal Pools

arXiv stat.ML / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Conformal Seasonal Pools (CSP), a training-free probabilistic time-series forecaster that generates uncertainty by combining same-season empirical samples with conformal signed-residual samples around a seasonal naive forecast.
  • On an audited rolling-origin benchmark using the six datasets where DeepNPTS was originally evaluated, CSP-Adaptive improves all reported metrics versus DeepNPTS, including CRPS, normalized mean quantile loss, and empirical 95% coverage.
  • CSP reports substantially better calibration, with mean empirical 95% coverage of 0.89 versus 0.66 for DeepNPTS, and the statistical evidence is extremely strong across metrics.
  • The authors highlight that DeepNPTS can fail more severely than overall coverage suggests: in the worst 10% of evaluation windows, its prediction intervals miss the truth across all multi-step horizons simultaneously.
  • CSP runs over 500× faster on CPU and the authors argue training-free conformal samplers should be mandatory baselines when evaluating learned non-parametric forecasters, especially for decision-critical settings.

Abstract

We propose Conformal Seasonal Pools (CSP), a training-free probabilistic time-series forecaster that mixes same-season empirical draws with signed residual draws around a seasonal naive forecast. In an audited rolling-origin benchmark on the six time-series datasets where DeepNPTS was originally evaluated (electricity, exchange_rate, solar_energy, taxi, traffic, wikipedia), CSP-Adaptive significantly outperforms DeepNPTS on every metric we report -- CRPS (per-window paired Wilcoxon p \approx 4 \times 10^{-10}), normalized mean quantile loss (p \approx 7 \times 10^{-10}), and empirical 95% coverage (p \approx 8 \times 10^{-45}, mean 0.89 vs 0.66) -- while running over 500x faster on CPU. Coverage is the most decision-critical of these: a 0.95 nominal interval that contains the truth in only ~66% of cases fails the basic calibration desideratum and would not survive deployment in safety- or decision-critical settings. The failure mode is also more severe than aggregate coverage suggests: in the worst 10% of windows, DeepNPTS's prediction interval covers none of the H forecast horizons -- the entire multi-step trajectory misses the truth at every step simultaneously. This poses serious risk in safety- and decision-critical applications such as healthcare, finance, energy operations, and autonomous systems, where prediction intervals that systematically miss the truth across the entire planning horizon translate directly into misclassified patients, regulatory capital failures, grid imbalances, and safety-case violations. CSP achieves all of this with no learned parameters and no training. We argue training-free conformal samplers should be mandatory baselines when evaluating learned non-parametric forecasters.