DINO Soars: DINOv3 for Open-Vocabulary Semantic Segmentation of Remote Sensing Imagery

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces CAFe-DINO, an open-vocabulary semantic segmentation model for remote sensing imagery designed to avoid costly RS-specific supervised fine-tuning.
It builds on DINOv3’s strong performance on the GEO-bench segmentation benchmark (surpassing RS foundation model SOTA without RS pre-training) and uses DINO.txt to enable open-vocabulary segmentation.
CAFe-DINO improves DINOv3’s text-image similarity with cost aggregation and training-free feature upsampling, while using only a small RS-targeted subset of COCO-Stuff for model tuning.
Experiments show state-of-the-art performance on major RS segmentation datasets, outperforming OVSS methods that are fine-tuned on RS data.
The authors publicly release the code and data at the provided GitHub repository, supporting reproducibility and further research.

Abstract

The remote sensing (RS) domain suffers from a lack of densely labeled datasets, which are costly to obtain. Thus, models that can segment RS imagery well without supervised fine-tuning are valuable, but existing solutions fall behind supervised methods. Recently, DINOv3 surpassed SOTA RS foundation models on the GEO-bench segmentation benchmark without pre-training on RS data. Additionally, DINO.txt has enabled open vocabulary semantic segmentation (OVSS) with the DINOv3 backbone. We leverage these developments to form an OVSS model for RS imagery, free of RS-domain fine-tuning. Our model, CAFe-DINO (Cost Aggregation + Feature Upsampling with DINO) exploits the strong OVSS performance of DINOv3 for RS imagery via cost aggregation and training-free upsampling of text-image similarity scores. The robust latent of the DINOv3 backbone eliminates the need for fine-tuning on RS imagery; we instead fine-tune our model on a RS-targeted subset of COCO-Stuff. CAFe-DINO achieves state-of-the-art performance on key RS segmentation datasets, outperforming OVSS methods fine-tuned on RS data. Our code and data are publicly available at https://github.com/rfaulk/DINO_Soars.