Envisioning global urban development with satellite imagery and generative AI

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a multimodal generative AI framework that uses prompts and geospatial controls to generate realistic, diverse urban satellite imagery for the world’s 500 largest metropolitan areas.
  • It supports scenario-based urban planning by letting users specify development goals and influence the resulting imagery through text and spatial constraints.
  • The method is designed to enable urban redevelopment use cases by learning from and conditioning on surrounding environmental context.
  • The approach learns latent representations of urban form that can transfer styles across cities via a global spatial network and improve downstream tasks such as carbon emission prediction.
  • Human expert evaluation indicates the generated images are comparable to real satellite images, suggesting potential for accelerated planning and cross-city learning.

Abstract

Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.