SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future

arXiv cs.CL / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SynopticBench, a large dataset of 1,367,041 National Weather Service Area Forecast Discussion texts paired with meteorological images (500mb geopotential height, 2m temperature, and 850mb wind).
  • It argues that weather forecasting text generation is especially difficult because the atmosphere is chaotic and varies across multiple spatial and temporal scales.
  • The authors propose SPACE (Synoptic Phenomena Alignment and Coverage Evaluation), a new evaluation framework aimed at measuring the quality of text descriptions of synoptic weather phenomena.
  • Experiments with state-of-the-art vision-language models show that current evaluation metrics are sensitive in this domain and that better evaluation is needed for reliable progress in weather/climate text generation.

Abstract

Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial and temporal scales. Given the complexity of atmospheric phenomena, it is critical to verifiably quantify the effectiveness of existing VLMs on weather forecasting data. In this work, we present SynopticBench, a high-quality dataset consisting of 1,367,041 text samples of Area Forecast Discussions created by the National Weather Service over the continental United States paired to images of 500mb geopotential height, 2 meter temperature, and 850mb wind velocity in weather forecasts. We also present Synoptic Phenomena Alignment and Coverage Evaluation (SPACE), a novel evaluation framework that can be used to effectively estimate the quality of text descriptions of synoptic weather phenomena. Extensive experiments on generating forecast discussions using state-of-the-art VLMs show the sensitivity of existing evaluation metrics in this domain and enable further exploration into synoptic weather and climate text generation.