Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes RecGen, a generative framework that jointly estimates multiple objects’ shapes/parts and their poses from one or multiple RGB-D images under occlusion and partial visibility.
  • RecGen is built on compositional synthetic scene generation and strong 3D shape priors, enabling it to generalize across different object categories and real-world environments.
  • Experiments show state-of-the-art results on challenging datasets with heavy occlusions, including robustness to symmetric objects, articulated parts, and complex geometry and textures.
  • RecGen improves over the prior best method (SAM3D) while using about 80% fewer training meshes, yielding significant gains in geometric shape quality, texture reconstruction, and pose estimation.

Abstract

Accurately reconstructing complex full multi-object scenes from sparse observations remains a core challenge in computer vision and a key step toward scalable and reliable simulation for robotics. In this work, we introduce RecGen, a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. RecGen achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture. Despite using nearly 80% fewer training meshes than the previous state of the art SAM3D, RecGen outperforms it by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation.