OMCL: Open-vocabulary Monte Carlo Localization

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents OMCL (Open-vocabulary Monte Carlo Localization), an extension of Monte Carlo Localization that uses vision-language features to compute observation likelihoods from a camera pose and a 3D map.
It targets cases where robot measurements and the map come from different sensor modalities, addressing limitations of prior closed-set, environment-specific localization methods.
OMCL supports cross-modality association between visual observations and map elements, enabling global localization initialization directly from natural-language descriptions of nearby objects.
Experiments on Matterport3D and Replica demonstrate indoor robustness, and results on SemanticKITTI show outdoor generalization.

Abstract

Robust robot localization is an important prerequisite for navigation, but it becomes challenging when the map and robot measurements are obtained from different sensors. Prior methods are often tailored to specific environments, relying on closed-set semantics or fine-tuned features. In this work, we extend Monte Carlo Localization with vision-language features, allowing OMCL to robustly compute the likelihood of visual observations given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. These open-vocabulary features enable us to associate observations and map elements from different modalities, and to natively initialize global localization through natural language descriptions of nearby objects. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.