Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
arXiv cs.CV / 4/21/2026
📰 NewsModels & Research
Key Points
- The paper introduces a sequential text-to-scene generation paradigm that jointly produces both scene layout and object shape/appearance, addressing limitations of prior methods that generate only one aspect.
- It proposes a new 3D autoregressive diffusion model (3D-ARD+) that unifies autoregressive generation over multimodal tokens with diffusion-based generation of next-object 3D latents.
- For each next object, the model uses a two-stage process: first generating coarse 3D latents in the scene space conditioned on the text and the already synthesized scene, then generating finer object-space latents for detailed geometry and appearance.
- The method is trained on a large dataset of 230K indoor scenes paired with text instructions, and experiments with a 7B-parameter model show it can follow non-trivial spatial layouts and semantics from the text.
- Overall, the work targets interactive 3D scene creation by improving consistency between generated scenes and complex textual descriptions of spatial arrangement, shape, and appearance.
Related Articles

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA
Where is Grok-2 Mini and Grok-3 (mini)?
Reddit r/LocalLLaMA