TriTS: Time Series Forecasting from a Multimodal Perspective

arXiv cs.CV / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper introduces TriTS, a multimodal framework for long-term time series forecasting that transforms 1D signals into orthogonal time, frequency, and 2D-vision representations.
  • It addresses the 1D-to-2D representation bottleneck using Period-Aware Reshaping and a Visual Mamba (Vim) component to capture cross-period dependencies with linear computational complexity.
  • For the frequency branch, it proposes Multi-Resolution Wavelet Mixing (MR-WM) to explicitly disentangle non-stationary signals into trend and noise and improve time-frequency localization.
  • TriTS keeps a streaming linear time-domain branch for numerical stability and fuses the three modalities dynamically to handle diverse data contexts.
  • Experiments on multiple benchmark datasets show state-of-the-art forecasting performance, with reduced parameter counts and lower inference latency compared with prior vision-based forecasters.

Abstract

Time series forecasting plays a pivotal role in critical sectors such as finance, energy, transportation, and meteorology. However, Long-term Time Series Forecasting (LTSF) remains a significant challenge because real-world signals contain highly entangled temporal dynamics that are difficult to fully capture from a purely 1D perspective. To break this representation bottleneck, we propose TriTS, a novel cross-modal disentanglement framework that projects 1D time series into orthogonal time, frequency, and 2D-vision spaces.To seamlessly bridge the 1D-to-2D modality gap without the prohibitive O(N^2) computational overhead of Vision Transformers (ViTs), we introduce a Period-Aware Reshaping strategy and incorporate Visual Mamba (Vim). This approach efficiently models cross-period dependencies as global visual textures while maintaining linear computational complexity. Complementing this, we design a Multi-Resolution Wavelet Mixing (MR-WM) module for the frequency modality, which explicitly decouples non-stationary signals into trend and noise components to achieve fine-grained time-frequency localization. Finally, a streaming linear branch is retained in the time domain to anchor numerical stability. By dynamically fusing these three complementary representations, TriTS effectively adapts to diverse data contexts. Extensive experiments across multiple benchmark datasets demonstrate that TriTS achieves state-of-the-art (SOTA) performance, fundamentally outperforming existing vision-based forecasters by drastically reducing both parameter count and inference latency.