A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces a new dataset and evaluation benchmark for complex 4D markerless human motion capture, designed to better reflect real-world challenges like multi-person interactions and heavy occlusions.
  • The dataset includes synchronized multi-view RGB and depth sequences with accurate camera calibration, ground-truth 3D motion from a Vicon system, and corresponding SMPL/SMPL-X parameters for tightly aligned supervision.
  • It covers both single- and multi-person scenarios featuring intricate motions, rapid position exchanges between similarly dressed subjects, varying subject distances, and frequent inter-person occlusions.
  • Benchmark results show that current state-of-the-art markerless 4D MoCap models experience substantial performance degradation when tested under these realistic conditions, revealing a persistent domain gap.
  • The authors report that targeted fine-tuning can improve generalization, suggesting the dataset is effective for driving more robust model development.

Abstract

Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset's realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.