OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces OmniVTG, a new large-scale dataset for open-world Video Temporal Grounding (VTG), where text queries must be localized to specific video time segments despite wide semantic diversity.
  • OmniVTG is built with a Semantic Coverage Iterative Expansion pipeline that detects vocabulary gaps in existing datasets and then collects videos likely to contain the missing concepts.
  • For annotation, the authors leverage findings that multimodal LLMs perform better at dense captioning than direct grounding, using a caption-centric pipeline to generate dense, timestamped descriptions.
  • The authors argue that simple supervised fine-tuning is not enough to close the common-vs-rare concept performance gap, and propose a Self-Correction Chain-of-Thought training paradigm that refines the model’s own predictions via multi-stage SFT, CoT finetuning, and reinforcement learning.
  • Experiments show the method achieves strong open-world grounding results on OmniVTG and sets state-of-the-art zero-shot performance on four existing VTG benchmarks, with accompanying code released on GitHub.

Abstract

Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.