VGGT-SLAM++

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces VGGT-SLAM++, a complete visual SLAM system that uses geometry-rich outputs from the Visual Geometry Grounded Transformer (VGGT) to improve odometry and mapping performance.
  • Its pipeline combines a transformer-based visual odometry front-end with Sim(3) solving, a DEM-based graph construction module, and a back-end designed to restore high-cadence local bundle adjustment (LBA) for better trajectory stability.
  • VGGT-SLAM++ builds dense planar-canonical digital elevation maps per VGGT submap, patches them, and uses DINOv2 embeddings plus visual place recognition (VPR) to integrate submaps into a covisibility graph.
  • By retrieving spatial neighbors within a covisibility window, it triggers frequent local optimization that substantially reduces short-horizon pose drift and improves graph convergence while keeping memory usage bounded.
  • Experiments on standard SLAM benchmarks report state-of-the-art accuracy, faster convergence, and maintained global consistency using compact DEM tiles and sublinear retrieval.

Abstract

We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.