Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses a key weakness in multi-frame feed-forward visual geometry estimation: while it improves cross-frame consistency, it can still lag strong per-frame methods on single-frame accuracy.
  • Through extensive ablation experiments, the authors find that increasing data diversity and quality boosts performance, while widely used confidence-aware losses and certain gradient-based loss mechanisms can unintentionally reduce accuracy.
  • Training with joint supervision using both per-sequence and per-frame alignment improves results, whereas local region alignment unexpectedly harms performance.
  • The paper proposes two technical improvements—a consistency loss that aligns depth maps, camera parameters, and point maps, and an architecture that effectively leverages high-resolution inputs—and integrates them into CARVE.
  • Experiments across point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE delivers strong, robust performance across multiple benchmarks.

Abstract

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.