ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
arXiv cs.CV / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- ProVG (Progressive Visual Grounding) targets remote sensing visual grounding by improving sentence-level alignment with fine-grained linguistic cues like spatial relations and object attributes.
- The method decouples language into global context, spatial relations, and object attributes, then integrates them using a progressive cross-modal modulator with a survey–locate–verify (coarse-to-fine) attention scheme.
- To handle remote-sensing-specific challenges, ProVG adds cross-scale fusion for large-scale variability and a language-guided calibration decoder to refine alignment during prediction.
- It uses a unified multi-task head to support both referring expression comprehension and segmentation, and reports state-of-the-art results on RRSIS-D and RISBench.
- The work introduces a stage-aware way to use different linguistic components across the grounding pipeline, yielding consistent performance gains over prior approaches.
Related Articles

Black Hat Asia
AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story
Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure
Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts
MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening
Reddit r/artificial