UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
arXiv cs.CV / 4/23/2026
📰 NewsModels & Research
Key Points
- The paper introduces UniCVR, the first unified zero-shot framework for composed image retrieval, multi-turn composed image retrieval, and composed video retrieval, using a single paradigm across all tasks.
- UniCVR combines Multimodal Large Language Models (MLLMs) for compositional query understanding with Vision-Language Pre-trained (VLP) encoders for structured visual retrieval, without relying on task-specific human-annotated data.
- In Stage I, it trains the MLLM as a compositional query embedder via contrastive learning on a large (~3.5M) multi-source dataset, including cluster-based hard negative sampling to improve supervision.
- In Stage II, it applies an MLLM-guided dual-level reranking strategy that uses adaptive, budgeted subset scoring over top candidates to improve ranking accuracy with low added computation.
- Experiments on five benchmarks covering all three tasks show UniCVR achieves state-of-the-art performance and strong generalization; the authors plan to release data and code after acceptance.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

GPT Image 2 vs DALL-E 3: What Actually Changed in OpenAI's New Image Model
Dev.to

AI Tutor for Science Students — Physics Chemistry Biology Solved by AI
Dev.to