Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

arXiv cs.CV / 4/20/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper proposes OVRSISBenchV2, a large-scale, application-oriented benchmark to better evaluate open-vocabulary remote sensing image segmentation under realistic open-world geospatial demands.
  • It introduces OVRSIS95K (about 95K image–mask pairs across 35 semantic categories) and expands evaluation with 10 downstream datasets, yielding 170K images and 128 categories to increase diversity and difficulty.
  • OVRSISBenchV2 goes beyond general open-vocabulary segmentation by adding downstream protocols for building extraction, road extraction, and flood detection.
  • The authors propose Pi-Seg, a baseline that improves transferability using a “positive-incentive noise” mechanism with learnable, semantically guided perturbations to broaden the visual-text feature space during training.
  • Experiments across OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show Pi-Seg performs strongly and consistently, especially on the harder OVRSISBenchV2 benchmark, and the code/datasets are publicly available.

Abstract

Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.