ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

arXiv cs.CV / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • ProDiGは、空撮(aerial)画像のみから地上(ground-level)の視点と整合的な3Dサイトモデルを生成する課題に対し、広い視点ギャップでも幾何学的に破綻しにくい進行的な復元手法を提案しています。
  • 従来の“後処理でのレンダリング改良”や“複数高度の地上正解”に依存せず、ProDiGは中間高度の表現を合成しながら段階的にGaussian表現を拡散モデルで洗練します。
  • 幾何構造(エピポラ構造)を参照ビューの拡散推論へ注入するgeometry-aware causal attentionモジュールと、カメラ距離に応じてGaussianのスケール/不透明度を動的調整するdistance-adaptiveモジュールにより、広い距離変化でも安定した再構成を実現します。
  • 合成データと実データの実験で、見た目の自然さ、3D幾何の整合性、極端な視点変化への頑健性の面で既存手法を大きく上回ると報告されています。

Abstract

Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.