OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

OneWorld proposes diffusion directly in a coherent 3D representation space using a 3D Unified Representation Autoencoder (3D-URAE) built on pretrained 3D foundation models.
It introduces token-level Cross-View-Correspondence (CVC) consistency loss to enforce structural alignment across views, enhancing cross-view stability.
It adds Manifold-Drift Forcing (MDF) to reduce train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations.
Experiments show OneWorld produces high-quality 3D scenes with superior cross-view consistency over state-of-the-art 2D-based methods, with code to be released on GitHub.

Abstract

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

Is AI becoming a bubble, and could it end like the dot-com crash?

Reddit r/artificial

Externalizing State

Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to

My AI Does Not Have a Clock

Dev.to

How to settle on a coding LLM ? What parameters to watch out for ?

Reddit r/LocalLLaMA

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Key Points

Abstract

Related Articles

Is AI becoming a bubble, and could it end like the dot-com crash?

Externalizing State

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

My AI Does Not Have a Clock

How to settle on a coding LLM ? What parameters to watch out for ?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer