Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “Satellite-to-Street View” synthesis to generate ground-level, post-disaster street perspectives from satellite imagery, aiming to improve situational awareness when ground data is unavailable.
  • It introduces two generative strategies: a VLM-guided method and a damage-sensitive Mixture-of-Experts (MoE) approach, designed to better align generated views with real disaster conditions.
  • The authors benchmark their methods against general-purpose baselines like Pix2Pix and ControlNet using a new Structure-Aware Evaluation Framework combining pixel quality, ResNet-based semantic consistency, and a VLM-as-a-Judge perceptual alignment step.
  • Experiments on 300 disaster scenarios show a realism–fidelity trade-off: diffusion/control methods can look realistic but may hallucinate structural details that are critical for reliable damage assessment.
  • Quantitatively, ControlNet attains the best semantic accuracy (0.71), while VLM-enhanced and MoE approaches tend to produce more texturally plausible outputs at the cost of semantic clarity.

Abstract

In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.