Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance
arXiv cs.LG / 4/24/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how the geographic composition and diversity of pretraining data affect the downstream performance of geospatial foundation models, an area that has been largely overlooked.
- Researchers built multiple pretraining datasets (global and per-continent) and evaluated them on both global and local downstream benchmarks, finding that Europe-pretrained data outperformed other pretraining setups.
- To explain these differences, they analyzed 10 pretraining datasets across diversity dimensions including continents, biomes, land cover, and spectral values.
- They report that spectral diversity shows a strong correlation with downstream performance, while other diversity factors have weak correlations, suggesting a key new dimension to include when designing high-performing pretraining data.
- The authors open-sourced seven new pretraining datasets, pretrained models, and their experimental framework to support further research.
Related Articles

Your MCP server probably has too many tools
Dev.to

MCP Auth That Actually Works: OAuth for Remote Servers
Dev.to

GoDavaii's Day 5: When 22 Indian Languages Redefine 'Hard' in Health AI
Dev.to

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results
Reddit r/LocalLLaMA
Corea arresta a hombre por imagen IA falsa del lobo Neukgu: hasta 5 años
Dev.to