TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that traditional Shot Boundary Detection (SBD) fails on complex transitions because it focuses on isolated cut points, often producing corrupted shot segments.
It proposes reformulating the problem as Shot Transition Detection (STD) by explicitly detecting the continuous temporal segments where transitions occur.
The authors introduce TransVLM, a vision-language model framework for STD that injects optical flow as a motion prior and fuses color-plus-motion features to improve temporal awareness without adding extra visual tokens to the language backbone.
To address class imbalance, they build a scalable data engine to synthesize diverse transition videos for training and release a comprehensive STD benchmark.
Experiments show TransVLM outperforms heuristic baselines, specialized spatiotemporal networks, and leading VLMs, and the approach has been deployed to production.

Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest

Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to

Automating FDA Compliance: AI for Specialty Food Producers

Dev.to

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Announcing the NVIDIA Nemotron 3 Super Build Contest

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Automating FDA Compliance: AI for Specialty Food Producers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer