Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing
arXiv cs.CV / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The article proposes a fast-then-fine (FTF) two-stage framework for remote sensing image-text retrieval, separating efficient candidate recall from fine-grained text-guided re-ranking.
- In the recall stage, it uses text-agnostic, coarse-grained representations to quickly select candidate matches without relying on expensive cross-modal interaction.
- In the re-ranking stage, it applies a parameter-free, balanced text-guided interaction block to improve fine-grained cross-modal alignment while avoiding additional learnable parameters.
- It introduces both inter- and intra-modal losses to jointly optimize alignment across multiple granular representations, and reports strong benchmark results with improved retrieval efficiency over prior approaches.
Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.
Dev.to

Training ChatGPT on Private Data: A Technical Reference
Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development
Dev.to

The Anatomy of a Modern AI Marketing Curriculum in 2026 — What It Covers and Why It Matters
Dev.to
AI as a Fascist Artifact
Dev.to