Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes ArticuSurDepth, a self-supervised multi-camera framework for surround-view depth estimation specifically targeting articulated vehicles that are difficult for existing passenger-vehicle-centric methods.
  • It improves depth learning by enforcing cross-view geometric consistency via multi-view spatial context enrichment, a cross-view surface normal constraint, and cross-vehicle pose consistency to handle coupled motions across articulated segments.
  • To encourage metric depth, the method adds camera height regularization grounded in ground-plane awareness, aiming to better align predicted depth scales with real-world geometry.
  • The authors validate the approach on a newly built articulated-vehicle experiment platform with a self-collected dataset, and report state-of-the-art performance on both their dataset and established benchmarks including DDAD, nuScenes, and KITTI.
  • The framework is guided by structural priors derived from a vision foundation model to enhance structural coherence across spatial and temporal contexts.

Abstract

Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.