TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces TT4D, a large-scale, high-fidelity dataset with 140+ hours of table tennis singles and doubles reconstructions derived from monocular broadcast videos.
  • TT4D includes multimodal annotations such as accurate camera calibrations, precise 3D ball positions and spin, time segmentation, and time-varying 3D human meshes.
  • The authors propose a new “lift-first” reconstruction pipeline that lifts the entire unsegmented 2D ball track into 3D using a learned network before performing time segmentation.
  • This inversion avoids failures of 2D-based time segmentation caused by occlusion and changing camera viewpoints, enabling reliable reconstruction even under severe occlusion.
  • The dataset is validated through downstream tasks including racket pose and velocity estimation at impact and training a generative model for competitive rallies.

Abstract

We present TT4D, a large-scale, high-fidelity table tennis dataset. It provides 140+ hours of reconstructed singles and doubles gameplay from monocular broadcast videos, featuring multimodal annotations like high-quality camera calibrations, precise 3D ball positions, ball spin, time segmentation, and 3D human meshes over time. This rich data provides a new foundation for virtual replay, in-depth player analysis, and robot learning. The dataset's combination of scale and precision is achieved through a novel reconstruction pipeline. Prior methods first partition a game sequence into individual shot segments based on the 2D ball track, and only then attempt reconstruction. However, 2D-based time segmentation collapses under occlusion and varied camera viewpoints, preventing reliable reconstruction. We invert this paradigm by first lifting the entire unsegmented 2D ball track to 3D through a learned lifting network. This 3D trajectory then allows us to reliably perform time segmentation. The learned lifting network also infers the ball's spin, handles unreliable ball detections, and successfully reconstructs the ball trajectory in cases of high occlusion. This lift-first design is necessary, as our pipeline is the only method capable of reconstructing table tennis gameplay from general-view broadcast monocular videos. We demonstrate the dataset's fidelity through two downstream tasks: estimating the racket's pose \& velocity at impact, and training a generative model of competitive rallies.