MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval
arXiv cs.CV / 3/19/2026
📰 NewsModels & Research
Key Points
- The paper introduces MCoT-MVS, a multi-level vision selection framework for Composed Image Retrieval (CIR) that leverages multi-modal chain-of-thought reasoning from a large language model to guide vision-text understanding.
- It uses reasoning cues to generate retained, removed, and target-inferred texts, which in turn guide two reference visual attention modules to extract discriminative patch-level and instance-level semantics from the reference image.
- A weighted hierarchical fusion module then combines these multi-granular visual cues with the modified text and imagined target description to align the query with target images in a unified embedding space.
- The method achieves state-of-the-art results on CIRR and FashionIQ benchmarks, and the authors publicly release code and trained models.
Related Articles
How We Built ScholarNet AI: An AI-Powered Study Platform for Students
Dev.to
Extracting and Following Paths for Robust Relational Reasoning with Large Language Models
arXiv cs.CL
Predictive Photometric Uncertainty in Gaussian Splatting for Novel View Synthesis
arXiv cs.CV
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
arXiv cs.CL
DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation
arXiv cs.CL