ViLL-E: Video LLM Embeddings for Retrieval

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

ViLL-E（Video-LLM-Embed）は、VideoLLMが得意なテキスト出力タスクに加えて、テキストからの動画検索やモーメント検索などの「埋め込み（retrieval）」領域での性能向上を狙った統合アーキテクチャです。
重要な特徴として、複雑な動画では長く推論（think longer）し、易しい動画では早期に停止（stop early）する埋め込み生成メカニズムを導入しています。
学習は「生成＋対比（contrastive）」を組み合わせ、(1)動画-キャプションの大規模事前学習、(2)詳細キャプションでの継続学習、(3)複数タスク（Video QA、Temporal Localization、Video Retrieval、Video-Text Matching）を扱うタスク別ファインチューニングの3段階で行います。
結果として、時間的ローカライゼーションで平均7%改善、動画検索でデュアルエンコーダ系に対して最大4%改善し、さらにゼロショットの合成検索や長文クエリ検索でもSotAを上回る成果を報告しています。

Abstract

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

Black Hat Asia

AI Business

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Reddit r/MachineLearning

I built a trading intelligence MCP server in 2 days — here's how

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

Reddit r/LocalLLaMA

ViLL-E: Video LLM Embeddings for Retrieval

Key Points

Abstract

Related Articles

Black Hat Asia

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

I built a trading intelligence MCP server in 2 days — here's how

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer