GenAssets: Generating in-the-wild 3D Assets in Latent Space

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • GenAssets introduces a 3D latent diffusion approach to generate high-quality 3D traffic-participant assets from in-the-wild LiDAR and camera data, aiming to improve realism and diversity for multi-sensor autonomy simulation.
  • The paper argues that prior neural-rendering reconstruction methods are too slow and often only render well near the original viewpoints, while diffusion methods struggle on sparse, occluded driving scenes.
  • A core contribution is a “reconstruct-then-generate” pipeline: occlusion-aware neural rendering builds a high-quality latent space, and then a diffusion model generates assets within that latent space.
  • The authors report that their method outperforms existing reconstruction and generation baselines, enabling more diverse and scalable content creation for simulation workflows.
  • The work is positioned as an enabler for safer end-to-end development of autonomous systems by generating complete geometry and appearance for simulated actors.

Abstract

High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a "reconstruct-then-generate" approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.