Adapting MLLMs for Nuanced Video Retrieval

arXiv cs.CV / 4/27/2026

💬 OpinionModels & Research

Key Points

  • The paper proposes a unified embedding model for nuanced video retrieval that explicitly addresses temporal nuance, negation in queries, and multimodal/composed retrieval scenarios.
  • It repurposes an existing Multimodal Large Language Model (MLLM) originally trained for text generation into an embedding model, then fine-tunes it using contrastive learning.
  • The approach uses carefully sampled hard negatives and contrastive loss to force the embedding space to encode the desired distinctions for temporal opposites and query negators.
  • Even though training is performed only on text, the method reports state-of-the-art results across nuanced video retrieval benchmarks and attributes gains to reduced modality gap between text and video embeddings.
  • The authors include an analysis explaining how the text-only training improves embedding organization and how this helps retrieval performance under the targeted nuances.

Abstract

Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.