A Comparison of Multi-View Stereo Methods for Photogrammetric 3D Reconstruction: From Traditional to Learning-Based Approaches

arXiv cs.CV / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study compares traditional multi-view stereo (MVS) using COLMAP against multiple learning-based MVS approaches, spanning geometry-guided and end-to-end architectures.
  • Experiments on aerial scenarios (MARS-LVIG with LiDAR-derived ground truth, and a Pix4D scene with Pix4Dmapper-generated ground truth) evaluate accuracy, coverage, and runtime across methods.
  • Results indicate COLMAP can produce geometrically consistent reconstructions but typically takes more computation time than learning-based alternatives.
  • When traditional image registration fails, learning-based methods show stronger feature matching and improved robustness.
  • Geometry-guided learning methods often require careful dataset preparation and may depend on camera pose or depth priors from COLMAP, while end-to-end methods (e.g., DUSt3R, VGGT) are faster but can have larger 3D residuals in difficult cases.

Abstract

Photogrammetric 3D reconstruction has long relied on traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) methods, which provide high accuracy but face challenges in speed and scalability. Recently, learning-based MVS methods have emerged, aiming for faster and more efficient reconstruction. This work presents a comparative evaluation between a representative traditional MVS pipeline (COLMAP) and state-of-the-art learning-based approaches, including geometry-guided methods (MVSNet, PatchmatchNet, MVSAnywhere, MVSFormer++) and end-to-end frameworks (Stereo4D, FoundationStereo, DUSt3R, MASt3R, Fast3R, VGGT). Two experiments were conducted on different aerial scenarios. The first experiment used the MARS-LVIG dataset, where ground-truth 3D reconstruction was provided by LiDAR point clouds. The second experiment used a public scene from the Pix4D official website, with ground truth generated by Pix4Dmapper. We evaluated accuracy, coverage, and runtime across all methods. Experimental results show that although COLMAP can provide reliable and geometrically consistent reconstruction results, it requires more computation time. In cases where traditional methods fail in image registration, learning-based approaches exhibit stronger feature-matching capability and greater robustness. Geometry-guided methods usually require careful dataset preparation and often depend on camera pose or depth priors generated by COLMAP. End-to-end methods such as DUSt3R and VGGT achieve competitive accuracy and reasonable coverage while offering substantially faster reconstruction. However, they exhibit relatively large residuals in 3D reconstruction, particularly in challenging scenarios.