COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

arXiv cs.CV / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces COVTrack++, a synergistic open-vocabulary multi-object tracking (OVMOT) framework that jointly improves detection and association via three modules: Multi-Cue Adaptive Fusion (MCF), Multi-Granularity Hierarchical Aggregation (MGA), and Temporal Confidence Propagation (TCP).
  • To address the lack of continuously annotated training data for OVMOT, the authors construct C-TAO, a continuously annotated dataset that increases annotation density by 26× over the original TAO and includes smooth motion/intermediate object states.
  • Experiments on TAO show state-of-the-art results, including novel TETA of 35.4% (validation) and 30.5% (test), along with improvements of 4.8% on novel AssocA and 5.8% on novel LocA versus prior methods.
  • The approach demonstrates strong zero-shot generalization on BDD100K, indicating it can track novel categories beyond training.
  • The authors state that both the code and dataset will be publicly released, supporting reproducibility and further research on continuous open-vocabulary tracking.

Abstract

Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.