AI Navigate

Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

arXiv cs.CV / 3/13/2026

📰 NewsModels & Research

Key Points

  • The paper addresses dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras by proposing a two-stage optimization framework that decouples robust camera tracking from dense depth refinement.
  • In stage one, it extends single-camera visual SLAM to multi-camera setups by building a spatiotemporal connection graph that leverages intra-camera temporal continuity and inter-camera spatial overlap, plus a wide-baseline initialization strategy using feed-forward reconstruction models for robustness with limited overlap.
  • In stage two, depth and camera poses are refined by enforcing dense inter- and intra-camera consistency through wide-baseline optical flow.
  • The work introduces MultiCamRobolab, a real-world dataset with ground-truth poses from a motion capture system.
  • Experiments show the method significantly outperforms state-of-the-art feed-forward models on synthetic and real-world benchmarks and uses less memory.

Abstract

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.