HELIOS: Hierarchical Exploration for Language-Grounded Interaction in Open Scenes

arXiv cs.RO / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces HELIOS, a hierarchical scene representation and search objective for language-grounded mobile manipulation in novel, partially observed environments.
  • HELIOS combines 2D navigation maps (semantic and occupancy) with actively built 3D Gaussian object representations, fusing multi-layer observations while enforcing multi-view detection consistency via a Dirichlet distribution.
  • The planning problem is cast as search over the hierarchical representation, with an objective that balances frontier/uncertainty exploration against expected information gain to improve object detection semantic consistency.
  • On the OVMM benchmark in the Habitat simulator, HELIOS achieves state-of-the-art performance, especially in large complex scenes with small target objects.
  • The method is also demonstrated in a real office setting using a Spot robot, leveraging pretrained VLMs and avoiding task-specific training for language-specified pick-and-place.

Abstract

Language-specified mobile manipulation tasks in novel environments simultaneously face challenges interacting with a scene which is only partially observed, grounding semantic information from language instructions to the partially observed scene, and actively updating knowledge of the scene with new observations. To address these challenges, we propose HELIOS, a hierarchical scene representation and associated search objective. We construct 2D maps containing the relevant semantic and occupancy information for navigation while simultaneously actively constructing 3D Gaussian representations of task-relevant objects. We fuse observations across this multi-layered representation while explicitly modeling the multi-view consistency of the detections of each object using the Dirichlet distribution. Planning is formulated as a search problem over our hierarchical representation. We formulate an objective that jointly considers (i) exploration of unobserved or uncertain regions of the environment and (ii) information gathering from additional observations of candidate objects. This objective integrates frontier-based exploration with the expected information gain associated with improving semantic consistency of object detections. We evaluate HELIOS on the OVMM benchmark in the Habitat simulator, a pick and place benchmark in which perception is challenging due to large and complex scenes with comparatively small target objects. HELIOS achieves state-of-the-art results on OVMM. We demonstrate HELIOS performing language specified pick and place in a real world office environment on a Spot robot. Our method leverages pretrained VLMs to achieve these results in simulation and the real world without any task specific training.