How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
arXiv cs.AI / 3/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current vision-language model benchmarks overemphasize visually plausible outputs and do not adequately test whether models understand the procedural and physical dependencies needed for real-world construction.
- It introduces DreamHouse, a new benchmark for “physical generative reasoning” where models must satisfy geometric, structural, constructability, and code-compliance constraints simultaneously.
- DreamHouse is grounded in residential timber-frame construction, leveraging codified engineering standards and objective verification tied to construction-document standards (LOD 350).
- The benchmark includes over 26,000 curated structures across 13 architectural styles and provides a deterministic 10-test structural validation framework.
- Unlike static leaderboards, DreamHouse supports iterative, agentic interaction with intermediate build states and feedback, revealing capability gaps in state-of-the-art VLMs that existing benchmarks miss.
広告
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested
Dev.to

We built a governance layer for AI-assisted development (with runtime validation and real system)
Dev.to
No AI system using the forward inference pass can ever be conscious.
Reddit r/artificial