Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

arXiv cs.CV / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Geo$^2$ is proposed as a unified framework for cross-view geo-spatial learning that jointly addresses Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS).
The method leverages 3D geometric priors from Geometric Foundation Models (e.g., VGGT), but introduces GeoMap to handle the large ground–aerial viewpoint gap by mapping both views into a shared 3D-aware latent space.
GeoFlow is presented as a flow-matching generative model conditioned on geometry-aware latent embeddings to enable bidirectional image synthesis between ground and aerial views.
A consistency loss is added to enforce latent alignment across the two synthesis directions, improving bidirectional coherence.
Experiments on CVUSA, CVACT, and VIGOR reportedly achieve state-of-the-art results for both localization and synthesis, suggesting 3D priors can significantly improve cross-view geo tasks.

Abstract

Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer