Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

arXiv cs.CV / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces MVTrackTrans, a Transformer-based multi-view crowd tracking model that improves tracking by modeling interactions between camera views and the ground plane.
Prior CNN-based multi-view crowd tracking approaches are limited by evaluation on small datasets (e.g., Wildtrack, MultiviewX), which makes it hard to apply them to real-world scenarios with larger spaces and heavy occlusion.
To address this gap, the authors collect and annotate two new large-scale real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, spanning larger scene sizes and longer time periods.
Experiments on the new large datasets show MVTrackTrans delivers better performance than existing methods, indicating the approach is well-suited for complex, large real-world scenes.
The datasets and code are released publicly via the provided GitHub repository to support further research and more practical deployments of the task.

Abstract

Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

r/LocalLLaMa Rule Updates

Reddit r/LocalLLaMA

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Key Points

Abstract

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

r/LocalLLaMa Rule Updates

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer