SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

arXiv cs.RO / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

SING3R-SLAM is introduced as a globally consistent, Gaussian-based monocular indoor SLAM method aimed at improving incremental global 3D reconstruction.
The framework uses a persistent Global Gaussian Map as a differentiable memory to reduce issues like accumulated drift and scale inconsistency common in prior approaches.
It performs local geometry reconstruction with submap-level global alignment, then further refines local geometry by leveraging consistency from the global map.
Experiments on real-world datasets show state-of-the-art results, including more than 10% pose accuracy improvement and finer, more detailed 3D geometry.
The method is reported to maintain a compact, memory-efficient global representation while enabling efficient 3D mapping for multiple downstream applications such as pose estimation and novel view rendering.

Abstract

Recent advances in dense 3D reconstruction have demonstrated strong capability in accurately capturing local geometry. However, extending these methods to incremental global reconstruction, as required in SLAM systems, remains challenging. Without explicit modeling of global geometric consistency, existing approaches often suffer from accumulated drift, scale inconsistency, and suboptimal local geometry. To address these issues, we propose SING3R-SLAM, a globally consistent Gaussian-based monocular indoor SLAM framework. Our approach represents the scene with a Global Gaussian Map that serves as a persistent, differentiable memory, incorporates local geometric reconstruction via submap-level global alignment, and leverages global map's consistency to further refine local geometry. This design enables efficient and versatile 3D mapping for multiple downstream applications. Extensive experiments show that SING3R-SLAM achieves state-of-the-art performance in pose estimation, 3D reconstruction, and novel view rendering. It improves pose accuracy by over 10%, produces finer and more detailed geometry, and maintains a compact and memory-efficient global representation on real-world datasets.