Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

arXiv cs.CV / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article proposes a fast-then-fine (FTF) two-stage framework for remote sensing image-text retrieval, separating efficient candidate recall from fine-grained text-guided re-ranking.
In the recall stage, it uses text-agnostic, coarse-grained representations to quickly select candidate matches without relying on expensive cross-modal interaction.
In the re-ranking stage, it applies a parameter-free, balanced text-guided interaction block to improve fine-grained cross-modal alignment while avoiding additional learnable parameters.
It introduces both inter- and intra-modal losses to jointly optimize alignment across multiple granular representations, and reports strong benchmark results with improved retrieval efficiency over prior approaches.

Abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Dev.to

Training ChatGPT on Private Data: A Technical Reference

Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

Dev.to

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters

Dev.to

AI as a Fascist Artifact

Dev.to

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Key Points

Abstract

Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Training ChatGPT on Private Data: A Technical Reference

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters

AI as a Fascist Artifact

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer