FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FAST, a unified spatiotemporal traffic forecasting framework that combines attention mechanisms for temporal patterns with state-space (Mamba-based) modeling for efficient spatial dependencies across sensor networks.
  • FAST uses a Temporal–Spatial–Temporal architecture, where temporal attention captures both short- and long-term dynamics while the spatial module models inter-sensor relationships with linear complexity.
  • To handle heterogeneous traffic contexts, FAST adds a learnable multi-source spatiotemporal embedding that fuses historical flow, temporal context, and node-level information.
  • The model also employs a multi-level skip prediction mechanism to enable hierarchical feature fusion for improved representation learning.
  • Experiments on PeMS04/07/08 show FAST outperforms strong Transformer, GNN, attention, and Mamba baselines, achieving up to 4.3% lower RMSE and 2.8% lower MAE, indicating a strong accuracy–scalability trade-off.

Abstract

Traffic forecasting requires modeling complex temporal dynamics and long-range spatial dependencies over large sensor networks. Existing methods typically face a trade-off between expressiveness and efficiency: Transformer-based models capture global dependencies well but suffer from quadratic complexity, while recent selective state-space models are computationally efficient yet less effective at modeling spatial interactions in graph-structured traffic data. We propose FAST, a unified framework that combines attention and state-space modeling for scalable spatiotemporal traffic forecasting. FAST adopts a Temporal-Spatial-Temporal architecture, where temporal attention modules capture both short- and long-term temporal patterns, and a Mamba-based spatial module models long-range inter-sensor dependencies with linear complexity. To better represent heterogeneous traffic contexts, FAST further introduces a learnable multi-source spatiotemporal embedding that integrates historical traffic flow, temporal context, and node-level information, together with a multi-level skip prediction mechanism for hierarchical feature fusion. Experiments on PeMS04, PeMS07, and PeMS08 show that FAST consistently outperforms strong baselines from Transformer-, GNN-, attention-, and Mamba-based families. In particular, FAST achieves the best MAE and RMSE on all three benchmarks, with up to 4.3\% lower RMSE and 2.8\% lower MAE than the strongest baseline, demonstrating a favorable balance between accuracy, scalability, and generalization.