UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that general vision-language models struggle on high-altitude UAVs due to a major domain shift, including tiny densely packed objects, repetitive textures, and ambiguous top-down orientations that break semantic grounding, spatial reasoning, and controllable generation.
  • It introduces UAVReason, a unified large-scale multimodal benchmark for nadir-view UAV scenarios built from a high-fidelity UAV simulation platform.
  • UAVReason aggregates more than 273K VQA-related samples across single-frame, temporal two-frame, and cross-modal generation settings, and evaluates 22 reasoning types spanning spatial and temporal axes.
  • The benchmark supports unified evaluation of reasoning and high-fidelity generation across multiple modalities (RGB, depth, segmentation), using metrics such as EM/F1 (VQA), mIoU (segmentation), and CLIP Score (generation).
  • The authors propose and validate a strong unified baseline trained with multi-task learning, showing that unified multi-task learning significantly improves UAV-native performance versus general-domain VLMs, with data/code/evaluation tools planned for public release.

Abstract

Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.