ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

arXiv cs.CV / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

ConInfer is introduced as a training-free open-vocabulary remote sensing segmentation framework that leverages vision-language models while improving how predictions are coordinated across the image.
The method targets a key limitation of prior approaches by moving beyond independent patch-wise inference to jointly predict multiple spatial units and explicitly model semantic dependencies between them.
By incorporating global contextual cues, ConInfer improves segmentation consistency, robustness, and generalization for large-scale remote sensing scenes with strong spatial and semantic correlations.
Experiments on multiple benchmark datasets show consistent gains over prior per-pixel VLM-based baselines (e.g., SegEarth-OV), with reported average improvements of 2.80% (open-vocabulary semantic segmentation) and 6.13% (object extraction).
The authors provide implementation code via a public GitHub repository, enabling replication and further experimentation.

Abstract

Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer