World2Minecraft: Occupancy-Driven Simulated Scenes Construction

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces World2Minecraft, a framework that transforms real-world scenes into editable Minecraft environments using 3D semantic occupancy prediction to enable high-fidelity embodied AI simulations.
  • The reconstructed Minecraft scenes can directly support downstream tasks such as Vision-Language Navigation (VLN), making simulation more reusable for embodied intelligence workflows.
  • The authors find that reconstruction quality is highly dependent on the accuracy and generalization of occupancy prediction models, which are currently constrained by limited data.
  • They propose a low-cost, automated, and scalable data acquisition pipeline to generate customized occupancy datasets and release MinecraftOcc, a large dataset with 100,165 images across 156 richly detailed indoor scenes.
  • Experiments indicate that MinecraftOcc both complements existing datasets and presents a substantial new benchmark challenge for current state-of-the-art methods, advancing occupancy prediction research and embodied AI tooling.

Abstract

Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation(VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed indoor scenes. Extensive experiments show that our dataset provides a critical complement to existing datasets and poses a significant challenge to current SOTA methods. These findings contribute to improving occupancy prediction and highlight the value of World2Minecraft in providing a customizable and editable platform for personalized embodied AI research. Project page:https://world2minecraft.github.io/.