UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces UniCVR, the first unified zero-shot framework for composed image retrieval, multi-turn composed image retrieval, and composed video retrieval, using a single paradigm across all tasks.
UniCVR combines Multimodal Large Language Models (MLLMs) for compositional query understanding with Vision-Language Pre-trained (VLP) encoders for structured visual retrieval, without relying on task-specific human-annotated data.
In Stage I, it trains the MLLM as a compositional query embedder via contrastive learning on a large (~3.5M) multi-source dataset, including cluster-based hard negative sampling to improve supervision.
In Stage II, it applies an MLLM-guided dual-level reranking strategy that uses adaptive, budgeted subset scoring over top candidates to improve ranking accuracy with low added computation.
Experiments on five benchmarks covering all three tasks show UniCVR achieves state-of-the-art performance and strong generalization; the authors plan to release data and code after acceptance.

Abstract

Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.