Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 本論文はモノキュラー深度推定(MDE)における主流のエンコーダ・デコーダ構造が抱える限界を整理し、エンコーダ特徴の改善余地が残っていると主張しています。
  • 深度推定を「事前学習済みエンコーダ特徴は劣化した特徴であり、そこから真の深度地図を生成する」という特徴復元(feature restoration)の観点で定式化し、InvT-IndDiffusion(Invertible Transform-enhanced Indirect Diffusion)で特徴復元を行う手法を提案しています。
  • 特徴に直接の教師がないため最終のスパース深度マップからの間接的な教師信号のみを使う点を扱い、拡散ステップ間で生じる特徴ずれを双リプシッツ条件を満たす可逆変換ベースのデコーダで抑制します。
  • さらに、利用可能な補助視点情報を使って局所的なディテールを高めるプラグアンドプレイのAV-LFE(Auxiliary Viewpoint-based Low-level Feature Enhancement)も導入し、複数データセットでSOTAを上回る結果を示しています。
  • KITTIベンチマークではベースライン比でRMSEが学習設定により4.09%および37.77%改善し、コードはGitHubで公開されています。

Abstract

Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.