Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces an Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF) for spatio-temporal crop yield prediction aimed at improving accuracy for food security and policy decisions.
  • It fuses multiple data streams—multi-year satellite imagery, high-resolution meteorological time-series, and initial soil properties—instead of relying on a single static source.
  • The model uses CNNs to extract spatial features and a temporal attention mechanism to dynamically focus on relevant phenological periods as conditions change over time.
  • Experiments report an R² score of 0.89, substantially outperforming baseline forecasting models, suggesting the attention-based multimodal approach better captures complex environmental relationships.
  • By explicitly modeling time-varying dependencies and coupling them with spatial cues from images and video sequences, the framework addresses limitations of conventional static-data methods.

Abstract

Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.