LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces LESV, a new framework for open-vocabulary 3D scene understanding that addresses limitations of 3D Gaussian Splatting-based methods, including spatial ambiguity and semantic bleeding from overlapping/unstructured Gaussians and mask pooling.
LESV replaces unstructured Gaussian representations with Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation, regularized using monocular depth and surface normal priors to stabilize geometry.
It enables deterministic, confidence-aware feature registration and claims improved suppression of semantic bleeding artifacts common in 3DGS pipelines.
To reduce multi-level semantic ambiguity, the method leverages dense alignment properties from the foundation model AM-RADIO rather than using computationally expensive hierarchical training.
The authors report state-of-the-art results on open-vocabulary 3D object retrieval and point cloud understanding benchmarks, with particular gains on fine-grained queries where prior registration methods struggle.

Abstract

Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.

Black Hat Asia

AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening

Reddit r/artificial

LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

Key Points

Abstract

Related Articles

Black Hat Asia

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

Portable eye scanner powered by AI expands access to low-cost community screening

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer