CNN-based Multi-In-Multi-Out Model for Efficient Spatiotemporal Prediction

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MIMO-ESP, a CNN-based multi-input multi-output model designed to improve spatiotemporal prediction efficiency and accuracy.
  • It targets limitations of prior CNN and Transformer approaches by enhancing global information modeling while reducing the computational burden typical of self-attention.
  • MIMO-ESP keeps the time axis separate from image channel processing and uses dilation to jointly and effectively capture spatiotemporal dependencies.
  • Experiments on video, traffic, and precipitation benchmark datasets show that MIMO-ESP achieves competitive efficiency while outperforming existing models.
  • Ablation studies further indicate that the proposed components meaningfully contribute to the model’s performance gains.

Abstract

Recently, Convolutional Neural Network (CNN) or Transformer architecture based models have been proposed to overcome the limitations of Recurrent Neural Network (RNN) based models in spatiotemporal prediction. These models prevent the inefficiency of parallelization limitation due to the sequential properties and stacked error due to the recursive method, and show high performance. Novertheless, there are still some challengies. First, CNN based models have difficulty considering global information due to the local properties of the kernel, and their performance is limited. In addition, information is mixed because the time axis is combined with the channel axis of the image for processing. Models based on Transformer architecture have high complexity due to the self-attention calcuation and take a long training time. In this paper, we propose a new structure model called CNN-based Multi-In-Multi-Out model for Efficient Spatiotemporal Prediction (MIMO-ESP) to overcome these limitations. MIMO-ESP considers global information and significantly improves complexity by configuring a Transformer architecture based on CNN. In addition, it treats the time axis as an independent axis without combining it, and effectively considers spatiotemporal information together by applying dilation. This structure makes MIMO-ESP efficient and high performance. Extensive experiment results on three promising benchmark datasets which including video, traffic, and precipitation prediction tasks demonstrate that the usefulness of MIMO-ESP due to the achieved competitive efficiency while outperforming existing models. Furthermore, the ablation study results demonstrate the usefulness of the components of MIMO-ESP, emphasizing the potential of the proposed approaches.