OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
arXiv cs.CV / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces OmniVTG, a new large-scale dataset for open-world Video Temporal Grounding (VTG), where text queries must be localized to specific video time segments despite wide semantic diversity.
- OmniVTG is built with a Semantic Coverage Iterative Expansion pipeline that detects vocabulary gaps in existing datasets and then collects videos likely to contain the missing concepts.
- For annotation, the authors leverage findings that multimodal LLMs perform better at dense captioning than direct grounding, using a caption-centric pipeline to generate dense, timestamped descriptions.
- The authors argue that simple supervised fine-tuning is not enough to close the common-vs-rare concept performance gap, and propose a Self-Correction Chain-of-Thought training paradigm that refines the model’s own predictions via multi-stage SFT, CoT finetuning, and reinforcement learning.
- Experiments show the method achieves strong open-world grounding results on OmniVTG and sets state-of-the-art zero-shot performance on four existing VTG benchmarks, with accompanying code released on GitHub.
Related Articles
LLMs will be a commodity
Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant
Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them
Dev.to