BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation

arXiv cs.CV / 4/1/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • BigEarthNet.txt is introduced as a large-scale multi-sensor Earth observation (RS) image-text dataset built from co-registered Sentinel-1 SAR and Sentinel-2 multispectral imagery.
  • The dataset includes 464,044 images paired with 9.6M text annotations featuring geographically anchored captions, visual question answering, and referring-expression instructions for bounding-box prediction.
  • The authors report that BigEarthNet.txt offers greater textual richness and more diverse annotation types than prior RS image-text datasets.
  • A manually verified benchmark split is provided to evaluate vision-language models on RS and CV tasks, highlighting current model limitations on complex land-use/land-cover (LULC) classes.
  • Fine-tuning with BigEarthNet.txt is reported to yield consistent performance improvements across the evaluated tasks.

Abstract

Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.