RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images

arXiv cs.CV / 4/23/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces RefAerial, a large-scale benchmark dataset for referring detection in aerial images, designed to overcome limitations of prior ground-image datasets.
RefAerial is characterized by low but diverse object-to-scene ratios, many targets and distractors, complex fine-grained referring descriptions, and broad diverse aerial scenes.
The authors develop REA-Engine, a human-in-the-loop semi-automated annotation system to efficiently generate referring pairs for the dataset.
They find that existing ground referring detection models degrade significantly on aerial data due to scale variety issues, and propose a scale-comprehensive and sensitive (SCS) framework using mixture-of-granularity attention plus a comprehensive-to-sensitive two-stage decoding strategy.
The proposed SCS framework delivers strong results on RefAerial and also shows performance gains on traditional ground referring detection datasets.

Abstract

Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.