Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

arXiv cs.RO / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Flash-Mono addresses key weaknesses of monocular 3D Gaussian Splatting SLAM—slow train-from-scratch optimization, geometry inaccuracies, and inconsistent multi-view scale—by shifting to a feed-forward approach that predicts Gaussian attributes directly from multi-frame context.
  • The system uses a recurrent feed-forward frontend with cross-attention to build a hidden state, jointly predicting camera poses and per-pixel Gaussian properties, and a 2D Gaussian splatting mapping backend for efficient reconstruction.
  • To tackle drift and improve global consistency, Flash-Mono leverages hidden states as compact submap descriptors for efficient loop closure and performs global Sim(3) optimization.
  • For improved geometric fidelity, it replaces conventional 3D Gaussian ellipsoids with 2D Gaussian surfels, and reports state-of-the-art tracking and mapping performance.
  • The method claims a 10x speedup over optimization-based GS-SLAM while maintaining high-quality rendering, targeting real-time embodied perception and reconstruction use cases.

Abstract

Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming \textit{Train-from-Scratch} optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a \textbf{10x} speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global \mathrm{Sim}(3) optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications. Project page: https://victkk.github.io/flash-mono.