Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
arXiv cs.CV / 4/20/2026
📰 NewsModels & Research
Key Points
- The paper addresses fine-grained image retrieval from either hand-drawn sketches or text by tackling the modality gap between structural contours (sketches) and appearance cues like color/texture (text).
- It proposes the Sketch and Text Based Image Retrieval (STBIR) framework that fuses sketch-derived structural outlines with text-provided color/texture information to improve retrieval accuracy.
- STBIR includes three main technical components: a curriculum-learning robustness module for queries of varying quality, a category-knowledge-based feature space optimization module to strengthen representations, and a multi-stage cross-modal alignment mechanism to reduce cross-modal misalignment.
- The authors also build a fine-grained STBIR benchmark dataset and report extensive experiments showing STBIR significantly outperforms existing state-of-the-art methods.
- Overall, the work contributes both a new multimodal retrieval approach and a benchmark to support future research on sketch-and-text-based fine-grained image search.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to

Space now with memory
Dev.to