Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting

arXiv stat.ML / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that conventional time-series forecasting evaluation is largely unfalsifiable because it relies on passive observation of single historical trajectories rather than controllable perturbations.
  • It proposes “Noise Titration,” an interventionist benchmarking method that injects calibrated Gaussian observation noise into systems with known dynamics to enable exact distributional evaluation via negative log-likelihood and calibrated distributional tests.
  • The work extends the Fern architecture into a probabilistic generative model that directly parameterizes covariance structures on the Symmetric Positive Definite (SPD) cone to improve calibrated joint forecasting without expensive Jacobian modeling.
  • Experiments suggest that state-of-the-art zero-shot foundation models fail under non-stationary regime shifts and higher noise due to context-parroting, while Fern better preserves the invariant measure and multivariate geometry for sharper calibration.

Abstract

Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.