HD-VGGT: High-Resolution Visual Geometry Transformer

arXiv cs.CV / 3/31/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • HD-VGGTは、高解像度画像での3D再構成を目的に、従来のVGGT系フィードフォワード手法が直面する高解像度・多視点による計算量/メモリ爆増の課題を抑えることを狙ったアーキテクチャです。
  • デュアルブランチ構成により、低解像度側で大域的に整合した粗いジオメトリを推定し、高解像度側で学習した特徴アップサンプリングにより細部を洗練します。
  • 解像度を上げるほど悪化しやすい、反復パターン・弱いテクスチャ・鏡面反射などの視覚的に曖昧な領域に起因する不安定なトークン問題に対し、Feature Modulationで信頼できない特徴を早期に抑制する方針を提案しています。
  • full-resolutionのトランスフォーマに比べてコストを抑えつつ、高解像度入力と同等の監督情報でSOTA級の再構成品質を実現すると述べています。

Abstract

High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.