ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

arXiv cs.RO / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • ZeD-MAP addresses real-time depth reconstruction for ultra-high-resolution UAV imagery by combining zero-shot diffusion depth predictions with a bundle-adjustment (BA) guided mapping pipeline.
  • The method groups streamed frames into overlapping clusters and runs incremental, cluster-level BA to produce metrically consistent poses and sparse 3D tie-points.
  • BA-derived tie-points are reprojected into selected frames to provide metric guidance for diffusion-based depth estimation, improving temporal/metric consistency compared with diffusion-only probabilistic inference.
  • Experiments on ground-marker UAV flights using the DLR MACS system show sub-meter accuracy (≈0.87 m XY error and ≈0.12 m Z error) while keeping per-image runtimes in the ~1.47–4.91 s range.
  • The authors argue that BA-based metric guidance yields consistency comparable to classical photogrammetry but with significantly faster processing for real-time 3D map generation, noting minor noise from manual annotations.

Abstract

Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.