FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

arXiv cs.CV / 3/23/2026

📰 NewsModels & Research

Key Points

  • FlowScene proposes a tri-branch generative model conditioned on multimodal graphs that jointly generates scene layouts, object shapes, and textures.
  • It introduces a rectified flow mechanism that exchanges object information during generation to enable collaborative reasoning across the object graph.
  • The approach enforces scene-level style coherence across structure and appearance, enabling fine-grained control over objects' geometry, textures, and relations.
  • Experimental results show FlowScene outperforms language-conditioned and graph-conditioned baselines in realism, style consistency, and alignment with human preferences.
  • By addressing limitations of prior methods, FlowScene aims to deliver high-fidelity, texture-rich indoor scenes suitable for industrial applications.

Abstract

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.