AI Navigate

QTrack: Query-Driven Reasoning for Multi-modal MOT

arXiv cs.CV / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • QTrack introduces a query-driven tracking paradigm that localizes and tracks only the target objects specified by natural language queries while maintaining temporal coherence and identity consistency.
  • The authors build RMOT26, a large-scale grounded-query MOT benchmark with sequence-level splits to prevent identity leakage and enable robust generalization evaluation.
  • They propose QTrack, an end-to-end vision-language model that combines multimodal reasoning with tracking-oriented localization.
  • A Temporal Perception-Aware Policy Optimization method with structured rewards is introduced to encourage motion-aware reasoning.
  • Extensive experiments demonstrate the effectiveness of language-guided tracking, and the authors release code and data at the provided GitHub URL.

Abstract

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack