AI Navigate

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • VGGT-360 is a training-free, zero-shot framework for panoramic depth estimation that reformulates the task as panorama-to-3D-to-depth using multi-view reconstructed 3D models and VGGT-style foundation models.
  • It introduces three plug-and-play modules: (i) uncertainty-guided adaptive projection slices to convert panoramas into perspective views and allocate more views to geometry-poor regions, (ii) structure-saliency enhanced attention to improve 3D reconstruction robustness and cross-view coherence, and (iii) correlation-weighted 3D model correction to reweight overlapping points based on attention-derived correlations for consistent geometry.
  • The approach unifies fragmented per-view reasoning into a coherent panoramic understanding by leveraging intrinsic 3D consistency and bridging domain gaps between panoramic inputs and perspective priors.
  • Extensive experiments show VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

Abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.