Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

arXiv cs.RO / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Flash-Mono addresses key weaknesses of monocular 3D Gaussian Splatting SLAM—slow train-from-scratch optimization, geometry inaccuracies, and inconsistent multi-view scale—by shifting to a feed-forward approach that predicts Gaussian attributes directly from multi-frame context.
The system uses a recurrent feed-forward frontend with cross-attention to build a hidden state, jointly predicting camera poses and per-pixel Gaussian properties, and a 2D Gaussian splatting mapping backend for efficient reconstruction.
To tackle drift and improve global consistency, Flash-Mono leverages hidden states as compact submap descriptors for efficient loop closure and performs global Sim(3) optimization.
For improved geometric fidelity, it replaces conventional 3D Gaussian ellipsoids with 2D Gaussian surfels, and reports state-of-the-art tracking and mapping performance.
The method claims a 10x speedup over optimization-based GS-SLAM while maintaining high-quality rendering, targeting real-time embodied perception and reconstruction use cases.

Abstract

Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming

\textit{Train-from-Scratch}

optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a

\textbf{10x}

speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global

\mathrm{Sim}(3)

optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications. Project page: https://victkk.github.io/flash-mono.

Black Hat Asia

AI Business

How Bash Command Safety Analysis Works in AI Systems

Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide

Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)

Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App

Dev.to

Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

Key Points

Abstract

Related Articles

Black Hat Asia

How Bash Command Safety Analysis Works in AI Systems

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide

How to Get Better Output from AI Tools (Without Burning Time and Tokens)

How I Added LangChain4j Without Letting It Take Over My Spring Boot App

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer