3D-IDE: 3D Implicit Depth Emergent

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 提案手法「3D-Implicit Depth Emergence(3D-IDE)」は、2Dと3Dの融合で生じがちなトレードオフを、3Dを明示的に符号化するのではなく“幾何の自己教師信号からの創発”として捉え直す方針を示しています。
  • 具体的には、fine-grainedなgeometry validatorやグローバルな表現制約などの補助目的により情報ボトルネックを設計し、視覚特徴と3D構造間の相互情報量を最大化して3D認識を自然に出現させます。
  • 既存法の課題だった深度・姿勢への依存を推論時に取り除き、外部の3D基盤モデルを“grafting”しないため、ゼロのレイテンシオーバーヘッドを狙っています。
  • 実験では複数の3Dシーン理解ベンチマークでSOTAを上回り、推論レイテンシを55%削減しつつ、下流タスクでも高い性能を維持したと報告しています。
  • コードはGitHubで公開され、導入・再現可能性を高めるとしています(github.com/ChushanZhang/3D-IDE)。

Abstract

Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at github.com/ChushanZhang/3D-IDE.