AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

arXiv cs.CV / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AirZoo, a unified large-scale dataset and benchmark aimed at enabling data-driven aerial geometric 3D vision where high-fidelity training data is scarce.
  • AirZoo uses a scalable generation pipeline based on world-scale photogrammetric 3D meshes, allowing researchers to render outdoor scenes with customizable UAV trajectories and controllable weather and illumination.
  • The dataset claims broad scene diversity, covering 378 regions across 22 countries and spanning both structured urban areas and complex unstructured natural environments.
  • It provides rich geometric supervision per frame, including pixel-level metric depth and precisely geo-referenced 6-DoF poses, and supports three evaluation tracks: aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.
  • Experiments indicate AirZoo can act as a strong pre-training resource, with fine-tuning producing substantial gains for state-of-the-art models and setting a new performance upper bound for aerial spatial intelligence.

Abstract

Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks -- aerial image retrieval, cross-view matching, and multi-view 3D reconstruction -- we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.