Training-Free Semantic Multi-Object Tracking with Vision-Language Models
arXiv cs.CV / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces TF-SMOT, a training-free semantic multi-object tracking pipeline that generates human-interpretable outputs (summaries, instance captions, and interaction labels) rather than only object trajectories.
- TF-SMOT is built by composing pretrained components for detection, promptable SAM2 mask-based tracking, and video-language generation using InternVideo2.5.
- For interaction semantics, it grounds interaction predicates and maps them to BenSMOT WordNet synsets via gloss-based semantic retrieval, using an LLM for disambiguation.
- Experiments on the BenSMOT benchmark show state-of-the-art tracking performance within the SMOT setting and improved summary/caption quality versus prior approaches.
- Interaction recognition performance is still limited by strict exact-match evaluation, with ablations suggesting that semantic overlap and WordNet label granularity strongly affect measured results.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to