SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SpatialFusion, a new framework aimed at giving unified image generation models intrinsic 3D geometric awareness for better spatially coherent outputs.
  • It augments an MLLM with a parallel spatial transformer via a Mixture-of-Transformers design, using shared self-attention so the system can infer metric-depth maps from semantic context.
  • Spatial geometric signals are then injected into the diffusion backbone through a dedicated depth adapter to provide explicit spatial constraints during generation.
  • Using a progressive two-stage training approach, SpatialFusion improves performance on spatially-aware benchmarks, reportedly outperforming strong baselines like GPT-4o.
  • The method is claimed to improve both text-to-image generation and image editing while keeping inference overhead negligible.

Abstract

Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.