AI Navigate

PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • PureCLIP-Depth is a new monocular depth estimation model that is completely prompt-free and decoder-free, operating entirely within the CLIP embedding space.
  • The method learns a direct RGB-to-depth mapping strictly inside the CLIP space, relying on conceptual information rather than traditional geometric features.
  • It achieves state-of-the-art performance among CLIP embedding-based MDE models on both indoor and outdoor datasets.
  • The authors have released the code on GitHub to enable reproducibility and further exploration.

Abstract

We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth