Training-Free Semantic Multi-Object Tracking with Vision-Language Models

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces TF-SMOT, a training-free semantic multi-object tracking pipeline that generates human-interpretable outputs (summaries, instance captions, and interaction labels) rather than only object trajectories.
TF-SMOT is built by composing pretrained components for detection, promptable SAM2 mask-based tracking, and video-language generation using InternVideo2.5.
For interaction semantics, it grounds interaction predicates and maps them to BenSMOT WordNet synsets via gloss-based semantic retrieval, using an LLM for disambiguation.
Experiments on the BenSMOT benchmark show state-of-the-art tracking performance within the SMOT setting and improved summary/caption quality versus prior approaches.
Interaction recognition performance is still limited by strict exact-match evaluation, with ablations suggesting that semantic overlap and WordNet label granularity strongly affect measured results.

Abstract

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.