RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

arXiv cs.CV / 4/23/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • RSRCC is a newly proposed remote sensing benchmark for change question-answering that focuses on explaining what changed in natural language, not just locating changes.
  • The dataset includes 126k questions (87k train, 17.1k validation, 22k test) and emphasizes localized, change-specific semantic reasoning.
  • The authors claim RSRCC is the first remote sensing change QA benchmark explicitly designed for fine-grained reasoning-based supervision.
  • RSRCC is constructed using a hierarchical semi-supervised pipeline that extracts candidate change regions from segmentation masks, filters them with image-text embeddings, and then performs retrieval-augmented vision-language curation with Best-of-N ranking to resolve ambiguity and reduce noise.
  • The dataset is publicly released on Hugging Face at the provided link for further research and evaluation.

Abstract

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.