From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

arXiv cs.RO / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper proposes a generative framework that maps real-world panoramic scenes into high-fidelity simulation environments to reduce the need for expensive real-data collection.
  • It generates diverse “digital cousin” scenes through semantic and geometric editing, leveraging realistic assets and high-quality physics engines to support interactive robot manipulation tasks.
  • The approach also uses multi-room stitching to build consistent large-scale environments for long-horizon navigation in complex layouts.
  • Experiments report strong sim-to-real correlation, and show that scaling up synthetic scene generation improves generalization to unseen variations in scenes and objects for robot learning and evaluation.

Abstract

Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.