AI Navigate

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

Key Points

  • Sky2Ground is a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction, combining synthetic imagery with real-world images across 51 sites to enable evaluation from global to local contexts.
  • The work highlights challenges such as satellite imagery degrading pose estimation performance under large altitude variations and reconstruction difficulties due to sparse geometric overlap and noise.
  • It benchmarks state-of-the-art pose estimation models (MASt3R, DUSt3R, Map Anything, VGGT) and introduces SkyNet with a curriculum-based training strategy to improve cross-view consistency, achieving 9.6% gains on RRA@5 and 18.1% on RTA@5.
  • Sky2Ground and SkyNet provide a new testbed and baseline for large-scale, multi-altitude 3D perception and camera localization, with code and models to be released publicly.
  • The dataset spans 51 sites with thousands of satellite, aerial, and ground images across wide altitude ranges and near-orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts.

Abstract

We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.