Extend3D: Town-Scale 3D Generation

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Extend3D, a training-free pipeline that generates town-scale 3D scenes from a single image using an object-centric 3D generative model as the core engine.
It extends the model’s latent space in the x–y directions and uses overlapping latent-space patches so the object-centric generator can be applied across large scenes and coupled over time steps.
To ensure correct spatial alignment for patch-wise image conditioning, the method initializes with a point-cloud prior from a monocular depth estimator and refines occluded regions iteratively using SDEdit.
The authors propose “under-noising” by treating incompleteness in 3D structure as noise during refinement to enable 3D completion, and they also optimize the extended latent to improve sub-scene consistency.
Experiments and user preference/quantitative evaluations indicate Extend3D outperforms prior approaches in geometric structure and texture fidelity.

Abstract

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the

x

and

y

directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.