FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

arXiv cs.RO / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • FUS3DMaps is a new research method for open-vocabulary semantic mapping that lets robots spatially localize previously unseen concepts without predefined class sets.
  • Instead of using only instance-level or only dense patch-level fusion, it maintains both an instance-level layer and a dense layer within a shared voxel map and fuses them via cross-layer interaction.
  • The approach improves the semantic quality of both layers while enabling scalable, accurate instance-level mapping by limiting dense processing and cross-layer fusion to a sliding spatial window.
  • Experiments on established 3D semantic segmentation benchmarks and large-scale multi-story scenes show that FUS3DMaps achieves strong open-vocabulary performance at building scales.
  • The authors plan to release additional materials and code via a project website.

Abstract

Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.