OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces OmniVTG, a new large-scale dataset for open-world Video Temporal Grounding (VTG), where text queries must be localized to specific video time segments despite wide semantic diversity.
OmniVTG is built with a Semantic Coverage Iterative Expansion pipeline that detects vocabulary gaps in existing datasets and then collects videos likely to contain the missing concepts.
For annotation, the authors leverage findings that multimodal LLMs perform better at dense captioning than direct grounding, using a caption-centric pipeline to generate dense, timestamped descriptions.
The authors argue that simple supervised fine-tuning is not enough to close the common-vs-rare concept performance gap, and propose a Self-Correction Chain-of-Thought training paradigm that refines the model’s own predictions via multi-stage SFT, CoT finetuning, and reinforcement learning.
Experiments show the method achieves strong open-world grounding results on OmniVTG and sets state-of-the-art zero-shot performance on four existing VTG benchmarks, with accompanying code released on GitHub.

Abstract

Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

Dev.to

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer