LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces LESV, a new framework for open-vocabulary 3D scene understanding that addresses limitations of 3D Gaussian Splatting-based methods, including spatial ambiguity and semantic bleeding from overlapping/unstructured Gaussians and mask pooling.
  • LESV replaces unstructured Gaussian representations with Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation, regularized using monocular depth and surface normal priors to stabilize geometry.
  • It enables deterministic, confidence-aware feature registration and claims improved suppression of semantic bleeding artifacts common in 3DGS pipelines.
  • To reduce multi-level semantic ambiguity, the method leverages dense alignment properties from the foundation model AM-RADIO rather than using computationally expensive hierarchical training.
  • The authors report state-of-the-art results on open-vocabulary 3D object retrieval and point cloud understanding benchmarks, with particular gains on fine-grained queries where prior registration methods struggle.

Abstract

Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.