ViLL-E: Video LLM Embeddings for Retrieval

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • ViLL-E(Video-LLM-Embed)は、VideoLLMが得意なテキスト出力タスクに加えて、テキストからの動画検索やモーメント検索などの「埋め込み(retrieval)」領域での性能向上を狙った統合アーキテクチャです。
  • 重要な特徴として、複雑な動画では長く推論(think longer)し、易しい動画では早期に停止(stop early)する埋め込み生成メカニズムを導入しています。
  • 学習は「生成+対比(contrastive)」を組み合わせ、(1)動画-キャプションの大規模事前学習、(2)詳細キャプションでの継続学習、(3)複数タスク(Video QA、Temporal Localization、Video Retrieval、Video-Text Matching)を扱うタスク別ファインチューニングの3段階で行います。
  • 結果として、時間的ローカライゼーションで平均7%改善、動画検索でデュアルエンコーダ系に対して最大4%改善し、さらにゼロショットの合成検索や長文クエリ検索でもSotAを上回る成果を報告しています。

Abstract

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).