FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

arXiv cs.RO / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • FreeOcc is a training-free framework for open-vocabulary occupancy prediction that avoids 3D annotations, pose ground truth, and any learning stage.
  • The method builds a globally consistent 3D occupancy map from monocular or RGB-D sequences using a four-stage pipeline: SLAM-based pose/geometry estimation, geometrically consistent Gaussian updates, off-the-shelf vision-language semantic association, and probabilistic Gaussian-to-voxel projection.
  • FreeOcc is described as pose-agnostic yet achieves more than 2× improvements in IoU and mIoU on EmbodiedOcc-ScanNet versus prior self-supervised methods.
  • The work also introduces ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and reports strong zero-shot transfer to novel environments outperforming supervised and self-supervised baselines.
  • The project leverages open-vocabulary semantics from existing vision-language models to connect language concepts with 3D occupancy outputs without additional training.

Abstract

Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over 2\times improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: https://the-masses.github.io/freeocc-web/.