UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation
arXiv cs.CV / 4/8/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that general vision-language models struggle on high-altitude UAVs due to a major domain shift, including tiny densely packed objects, repetitive textures, and ambiguous top-down orientations that break semantic grounding, spatial reasoning, and controllable generation.
- It introduces UAVReason, a unified large-scale multimodal benchmark for nadir-view UAV scenarios built from a high-fidelity UAV simulation platform.
- UAVReason aggregates more than 273K VQA-related samples across single-frame, temporal two-frame, and cross-modal generation settings, and evaluates 22 reasoning types spanning spatial and temporal axes.
- The benchmark supports unified evaluation of reasoning and high-fidelity generation across multiple modalities (RGB, depth, segmentation), using metrics such as EM/F1 (VQA), mIoU (segmentation), and CLIP Score (generation).
- The authors propose and validate a strong unified baseline trained with multi-task learning, showing that unified multi-task learning significantly improves UAV-native performance versus general-domain VLMs, with data/code/evaluation tools planned for public release.

