MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces MCoT-MVS, a multi-level vision selection framework for Composed Image Retrieval (CIR) that leverages multi-modal chain-of-thought reasoning from a large language model to guide vision-text understanding.
It uses reasoning cues to generate retained, removed, and target-inferred texts, which in turn guide two reference visual attention modules to extract discriminative patch-level and instance-level semantics from the reference image.
A weighted hierarchical fusion module then combines these multi-granular visual cues with the modified text and imagined target description to align the query with target images in a unified embedding space.
The method achieves state-of-the-art results on CIRR and FashionIQ benchmarks, and the authors publicly release code and trained models.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

How We Built ScholarNet AI: An AI-Powered Study Platform for Students

Dev.to

Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

arXiv cs.CL

Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis

arXiv cs.CV

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

arXiv cs.CL

DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation

arXiv cs.CL

MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Key Points

Abstract

Related Articles

How We Built ScholarNet AI: An AI-Powered Study Platform for Students

Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer