UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

Key Points

  • The paper introduces UniCVR, the first unified zero-shot framework for composed image retrieval, multi-turn composed image retrieval, and composed video retrieval, using a single paradigm across all tasks.
  • UniCVR combines Multimodal Large Language Models (MLLMs) for compositional query understanding with Vision-Language Pre-trained (VLP) encoders for structured visual retrieval, without relying on task-specific human-annotated data.
  • In Stage I, it trains the MLLM as a compositional query embedder via contrastive learning on a large (~3.5M) multi-source dataset, including cluster-based hard negative sampling to improve supervision.
  • In Stage II, it applies an MLLM-guided dual-level reranking strategy that uses adaptive, budgeted subset scoring over top candidates to improve ranking accuracy with low added computation.
  • Experiments on five benchmarks covering all three tasks show UniCVR achieves state-of-the-art performance and strong generalization; the authors plan to release data and code after acceptance.

Abstract

Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.