Adapting MLLMs for Nuanced Video Retrieval

arXiv cs.CV / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The paper proposes a unified embedding model for nuanced video retrieval that explicitly addresses temporal nuance, negation in queries, and multimodal/composed retrieval scenarios.
It repurposes an existing Multimodal Large Language Model (MLLM) originally trained for text generation into an embedding model, then fine-tunes it using contrastive learning.
The approach uses carefully sampled hard negatives and contrastive loss to force the embedding space to encode the desired distinctions for temporal opposites and query negators.
Even though training is performed only on text, the method reports state-of-the-art results across nuanced video retrieval benchmarks and attributes gains to reduced modality gap between text and video embeddings.
The authors include an analysis explaining how the text-only training improves embedding organization and how this helps retrieval performance under the targeted nuances.

Abstract

Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

Why isn’t LLM reasoning done in vector space instead of natural language?

Reddit r/LocalLLaMA

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

Reddit r/LocalLLaMA

Adapting MLLMs for Nuanced Video Retrieval

Key Points

Abstract

Related Articles

An improvement of the convergence proof of the ADAM-Optimizer

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

Why isn’t LLM reasoning done in vector space instead of natural language?

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer