Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

arXiv cs.CV / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses two key weaknesses in existing multi-object tracking motion estimation: instability from noisy/probabilistic predictions and trajectory fragmentation under occlusion.
  • It proposes a collaborative reasoning framework where correlated objects mutually constrain motion states to stabilize estimates and maintain plausible trajectory continuity during occlusion.
  • The method, HyperSSM, combines a Hypergraph module (to model spatial motion correlations via dynamic hyperedges) with a State Space Model (SSM) (to enforce temporal smoothness through structured state transitions).
  • Experiments on MOT17, MOT20, DanceTrack, and SportsMOT show state-of-the-art results across varied motion patterns and scene complexities.
  • Overall, the work presents unified spatial-temporal reasoning that jointly optimizes spatial consensus and temporal coherence for more robust MOT.

Abstract

Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.