Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a unified framework for text-guided meteorological time-series generation that accounts for the spectral-temporal structure of weather signals.
  • It introduces MeteoCap-3B, a billion-scale multimodal meteorological dataset with expert-level captions produced via a multi-agent collaborative captioning pipeline to improve physical consistency.
  • The proposed MTransformer is a diffusion-based model that uses a Spectral Prompt Generator and frequency-aware attention to map text into multi-band spectral priors for more precise semantic control.
  • Experiments report state-of-the-art generation quality, strong cross-modal alignment, and improved semantic controllability, with downstream forecasting gains especially in data-sparse and zero-shot scenarios.
  • The approach also shows generalization on broader time-series benchmarks, suggesting the method may apply beyond meteorology.

Abstract

Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.