AI Navigate

Integrating Inductive Biases in Transformers via Distillation for Financial Time Series Forecasting

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes TIPS (Transformer with Inductive Prior Synthesis), a knowledge-distillation framework that blends causality, locality, and periodicity biases inside a Transformer to improve forecasting of non-stationary financial time series.
  • TIPS trains bias-specialized Transformer teachers via attention masking and distills their knowledge into a single student model with regime-dependent alignment across biases.
  • Across four major equity markets, TIPS achieves state-of-the-art performance and outperforms strong ensembles in annual return, Sharpe ratio, and Calmar ratio while requiring only 38% of the inference-time computation.
  • The results highlight regime-dependent utilization of inductive biases for robust generalization in changing financial regimes.

Abstract

Transformer-based models have been widely adopted for time-series forecasting due to their high representational capacity and architectural flexibility. However, many Transformer variants implicitly assume stationarity and stable temporal dynamics -- assumptions routinely violated in financial markets characterized by regime shifts and non-stationarity. Empirically, state-of-the-art time-series Transformers often underperform even vanilla Transformers on financial tasks, while simpler architectures with distinct inductive biases, such as CNNs and RNNs, can achieve stronger performance with substantially lower complexity. At the same time, no single inductive bias dominates across markets or regimes, suggesting that robust financial forecasting requires integrating complementary temporal priors. We propose TIPS (Transformer with Inductive Prior Synthesis), a knowledge distillation framework that synthesizes diverse inductive biases -- causality, locality, and periodicity -- within a unified Transformer. TIPS trains bias-specialized Transformer teachers via attention masking, then distills their knowledge into a single student model with regime-dependent alignment across inductive biases. Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55%, 9%, and 16% in annual return, Sharpe ratio, and Calmar ratio, while requiring only 38% of the inference-time computation. Further analyses show that TIPS generates statistically significant excess returns beyond both vanilla Transformers and its teacher ensembles, and exhibits regime-dependent behavioral alignment with classical architectures during their profitable periods. These results highlight the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series.