AI Navigate

Interact3D: Compositional 3D Generation of Interactive Objects

arXiv cs.CV / 3/18/2026

📰 NewsModels & Research

Key Points

  • Interact3D introduces a framework for generating physically plausible interacting 3D compositional objects from a single image, addressing occlusions and maintaining object-object spatial relationships.
  • The approach uses a two-stage composition pipeline: global-to-local registration to anchor the primary object and differentiable SDF-based optimization to integrate additional assets while penalizing intersections.
  • A closed-loop refinement strategy leverages a Vision-Language Model to analyze multi-view renderings, generate corrective prompts, and guide an image editing module to self-correct.
  • Experiments show enhanced geometric fidelity, reduced collisions, and consistent spatial relationships in collision-aware compositions compared with prior 3D compositional generation methods.

Abstract

Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.