Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

arXiv cs.CV / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Fus3Dは、非整列な画像コレクションからカメラキャリブレーションや後処理の融合なしで、3秒未満のフィードフォワード手法により密なSDF（Signed Distance Field）を回帰することを提案している。
既存手法が変換器中間特徴を各ビュー向けの予測ヘッドに分岐して捨ててしまう点に着目し、マルチビュー幾何トランスフォーマ特徴から直接3D抽出するための学習済みのボリューム抽出（ボクセル化した正準埋め込み）を用いる。
クロスアテンションとセルフアテンションを交互に行いながら、マルチビュー幾何情報を吸収する構造化されたボリューメトリック潜在グリッドを生成し、簡単な畳み込みデコーダでSDFへ写像する。
深度マップや3DアセットからSDFを生成して行う、妥当性を意識したスケーラブルな教師信号設計を導入し、非ウォータタイト（watertight）メッシュなど現実的な課題に対処している。
スパース/デンスいずれのビュー設定でも距離値が完全かつ整った形で得られ、幾何学的に妥当な補完が示されている。

Abstract

We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer