Solar-VLM: Multimodal Vision-Language Models for Augmented Solar Power Forecasting

arXiv cs.AI / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Solar-VLM, a multimodal vision-language-model framework aimed at improving photovoltaic (PV) power forecasting under highly weather- and cloud-dependent conditions.
  • It unifies three input types—multivariate time-series at PV sites, satellite imagery for cloud cover, and textual weather histories—using modality-specific encoders (a patch-based time-series encoder, a Qwen-based vision encoder, and a text encoder).
  • To capture spatial dependencies across geographically distributed PV stations, Solar-VLM adds a cross-site fusion design that uses graph attention over a K-nearest-neighbor station graph plus cross-site attention for adaptive information exchange.
  • Experiments on eight PV stations in northern China show the framework’s effectiveness, and the authors provide a public GitHub implementation.

Abstract

Photovoltaic (PV) power forecasting plays a critical role in power system dispatch and market participation. Because PV generation is highly sensitive to weather conditions and cloud motion, accurate forecasting requires effective modeling of complex spatiotemporal dependencies across multiple information sources. Although recent studies have advanced AI-based forecasting methods, most fail to fuse temporal observations, satellite imagery, and textual weather information in a unified framework. This paper proposes Solar-VLM, a large-language-model-driven framework for multimodal PV power forecasting. First, modality-specific encoders are developed to extract complementary features from heterogeneous inputs. The time-series encoder adopts a patch-based design to capture temporal patterns from multivariate observations at each site. The visual encoder, built upon a Qwen-based vision backbone, extracts cloud-cover information from satellite images. The text encoder distills historical weather characteristics from textual descriptions. Second, to capture spatial dependencies across geographically distributed PV stations, a cross-site feature fusion mechanism is introduced. Specifically, a Graph Learner models inter-station correlations through a graph attention network constructed over a K-nearest-neighbor (KNN) graph, while a cross-site attention module further facilitates adaptive information exchange among sites. Finally, experiments conducted on data from eight PV stations in a northern province of China demonstrate the effectiveness of the proposed framework. Our proposed model is publicly available at https://github.com/rhp413/Solar-VLM.