Satellite-Free Training for Drone-View Geo-Localization

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses drone-view geo-localization (DVGL) in GPS-denied areas by retrieving the correct geotagged satellite tile from a reference gallery using UAV observations, but avoids dependence on satellite imagery during training.
It introduces a satellite-free training (SFT) framework for multi-view UAV sequences that first builds a geometry-normalized drone-side representation before cross-view retrieval.
The method performs dense 3D scene reconstruction from multi-view drone images using 3D Gaussian splatting, then projects the reconstructed geometry into PCA-guided pseudo-orthophotos.
It refines the pseudo-orthophotos via lightweight geometry-guided inpainting to produce texture-complete views suitable for robust feature extraction.
For retrieval, the approach uses DINOv3 patch features from the generated orthophotos, trains a Fisher vector aggregation model using drone-only data, and achieves strong results on University-1652 and SUES-200, reducing the performance gap versus satellite-supervised methods.

Abstract

Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.