GenAssets: Generating in-the-wild 3D Assets in Latent Space

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

GenAssets introduces a 3D latent diffusion approach to generate high-quality 3D traffic-participant assets from in-the-wild LiDAR and camera data, aiming to improve realism and diversity for multi-sensor autonomy simulation.
The paper argues that prior neural-rendering reconstruction methods are too slow and often only render well near the original viewpoints, while diffusion methods struggle on sparse, occluded driving scenes.
A core contribution is a “reconstruct-then-generate” pipeline: occlusion-aware neural rendering builds a high-quality latent space, and then a diffusion model generates assets within that latent space.
The authors report that their method outperforms existing reconstruction and generation baselines, enabling more diverse and scalable content creation for simulation workflows.
The work is positioned as an enabler for safer end-to-end development of autonomous systems by generating complete geometry and appearance for simulated actors.

Abstract

High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a "reconstruct-then-generate" approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.