Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

arXiv cs.LG / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how the geographic composition and diversity of pretraining data affect the downstream performance of geospatial foundation models, an area that has been largely overlooked.
  • Researchers built multiple pretraining datasets (global and per-continent) and evaluated them on both global and local downstream benchmarks, finding that Europe-pretrained data outperformed other pretraining setups.
  • To explain these differences, they analyzed 10 pretraining datasets across diversity dimensions including continents, biomes, land cover, and spectral values.
  • They report that spectral diversity shows a strong correlation with downstream performance, while other diversity factors have weak correlations, suggesting a key new dimension to include when designing high-performing pretraining data.
  • The authors open-sourced seven new pretraining datasets, pretrained models, and their experimental framework to support further research.

Abstract

New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.