AI Navigate

$D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces D^3-RSMDE, a framework that combines a ViT-based fast depth map initialization with a Progressive Linear Blending Refinement (PLBR) stage using a lightweight U-Net to refine details.
  • The refinement operates in a compact latent space under a VAE, replacing the time-consuming initial structure generation of diffusion models.
  • It reports an 11.85% reduction in the LPIPS perceptual metric over leading models such as Marigold and more than 40x speedup in inference while keeping VRAM comparable to lightweight ViT models.
  • The approach enables real-time, high-fidelity monocular depth estimation from remote sensing imagery, broadening practical applications.

Abstract

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation (D^3-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that D^3-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.