GGD-SLAM: Monocular 3DGS SLAM Powered by Generalizable Motion Model for Dynamic Environments

arXiv cs.RO / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GGD-SLAM, a monocular SLAM framework that leverages 3D Gaussian Splatting to produce high-fidelity dense maps while overcoming the common failure of SLAM in dynamic environments.
  • GGD-SLAM uses a generalizable motion model to improve both localization (camera pose estimation) and dense reconstruction without relying on predefined semantic annotations or external depth input.
  • The system incorporates a FIFO queue plus sequential attention for dynamic semantic feature extraction, together with a dynamic feature enhancer to disentangle static and dynamic components.
  • It reduces the harmful influence of dynamic distractors by filling occluded regions using static information sampling and by introducing a distractor-adaptive SSIM loss designed specifically for dynamic scenes.
  • Experiments on real-world dynamic datasets report state-of-the-art performance for pose estimation and dense reconstruction in dynamic settings.

Abstract

Visual SLAM algorithms achieve significant improvements through the exploration of 3D Gaussian Splatting (3DGS) representations, particularly in generating high-fidelity dense maps. However, they depend on a static environment assumption and experience significant performance degradation in dynamic environments. This paper presents GGD-SLAM, a framework that employs a generalizable motion model to address the challenges of localization and dense mapping in dynamic environments - without predefined semantic annotations or depth input. Specifically, the proposed system employs a First-In-First-Out (FIFO) queue to manage incoming frames, facilitating dynamic semantic feature extraction through a sequential attention mechanism. This is integrated with a dynamic feature enhancer to separate static and dynamic components. Additionally, to minimize dynamic distractors' impact on the static components, we devise a method to fill occluded areas via static information sampling and design a distractor-adaptive Structure Similarity Index Measure (SSIM) loss tailored for dynamic environments, significantly enhancing the system's resilience. Experiments conducted on real-world dynamic datasets demonstrate that the proposed system achieves state-of-the-art performance in camera pose estimation and dense reconstruction in dynamic scenes.