FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

FIRE-CIR tackles composed image retrieval by adding fine-grained compositional reasoning to decide which visual attributes to preserve versus modify from a text instruction, improving both accuracy and interpretability in fashion-specific settings.
Instead of relying only on embedding similarity, the model generates attribute-focused visual questions from the modification text and checks visual evidence across the reference and candidate images.
The approach is trained using a newly constructed large-scale fashion visual question answering dataset with questions that require single- or dual-image analysis.
On the Fashion IQ benchmark, FIRE-CIR achieves higher retrieval accuracy than state-of-the-art methods and provides attribute-level, explainable justifications for why particular candidates are re-ranked or filtered.

Abstract

Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.

Black Hat Asia

AI Business

Apple is building smart glasses without a display to serve as an AI wearable

THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

Key Points

Abstract

Related Articles

Black Hat Asia

Apple is building smart glasses without a display to serve as an AI wearable

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer