Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes RecGen, a generative framework that jointly estimates multiple objects’ shapes/parts and their poses from one or multiple RGB-D images under occlusion and partial visibility.
RecGen is built on compositional synthetic scene generation and strong 3D shape priors, enabling it to generalize across different object categories and real-world environments.
Experiments show state-of-the-art results on challenging datasets with heavy occlusions, including robustness to symmetric objects, articulated parts, and complex geometry and textures.
RecGen improves over the prior best method (SAM3D) while using about 80% fewer training meshes, yielding significant gains in geometric shape quality, texture reconstruction, and pose estimation.

Abstract

Accurately reconstructing complex full multi-object scenes from sparse observations remains a core challenge in computer vision and a key step toward scalable and reliable simulation for robotics. In this work, we introduce RecGen, a generative framework for probabilistic joint estimation of object and part shapes, as well as their pose under occlusion and partial visibility from one or multiple RGB-D images. By leveraging compositional synthetic scene generation and strong 3D shape priors, RecGen generalizes across diverse object types and real-world environments. RecGen achieves state-of-the-art performance on complex, heavily occluded datasets, robustly handling severe occlusions, symmetric objects, object parts, and intricate geometry and texture. Despite using nearly 80% fewer training meshes than the previous state of the art SAM3D, RecGen outperforms it by 30.1% in geometric shape quality, 9.1% in texture reconstruction, and 33.9% in pose estimation.