Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 本論文は、2Dには強い一方で実世界の映像処理における物理的な空間認識が弱いマルチモーダルLLM(MLLM)に対し、幾何学的プリオル(geometric priors)をより効果的に統合する手法を提案しています。
  • 従来の「単一の深い層からの抽出+入力レベルのフュージョン」による情報のフラット化が、局所幾何の損失や初期層での意味ミスマッチを招くボトルネックになると指摘しています。
  • 提案手法GUIDE(Geometric Unrolling Inside MLLM Early-layers)では、幾何学エンコーダ内で多段階のサンプリングを行い、エッジから大域トポロジまでの多粒度特徴を得たうえで、MLLMの初期層へ段階的にアライン&フュージョンします。
  • さらに、文脈に応じて必要な空間手がかりを取得するcontext-aware gatingを導入し、空間プリオルの有効活用と冗長な幾何ノイズの抑制を両立させます。
  • 実験では、GUIDEが複雑な空間推論・知覚タスクで既存ベースラインを大きく上回り、3D幾何プリオルを大規模モデルへ統合する新しいパラダイムを示したとしています。

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.