WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
arXiv cs.CV / 3/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates whether 2D foundation image models inherently possess 3D world model capabilities for 3D world synthesis.
- It introduces an agentic architecture with a VLM-based director, a generator for new views, and a VLM-backed two-step verifier to curate frames across 2D and 3D spaces.
- Through extensive experiments on multiple state-of-the-art image generation models and Vision-Language Models, it shows that 2D models can encapsulate an understanding of 3D worlds and produce coherent 3D-consistent scenes.
- The proposed approach enables synthesizing expansive, realistic 3D worlds that can be explored via rendering novel views.
- This work suggests a practical framework for using 2D foundation models as agents to generate and refine 3D world representations, impacting future 3D content creation pipelines.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to