VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • VidTAG is a proposed dual-encoder framework for fine-grained video geolocalization that retrieves frame-to-GPS correspondences using self-supervised and language-aligned features.
  • The work addresses temporal inconsistency in video predictions by introducing TempGeo for aligning frame embeddings and GeoRefiner (an encoder–decoder) for refining GPS features based on those aligned embeddings.
  • Experiments on Mapillary (MSLS) and GAMa show temporally consistent trajectory generation and results that outperform GeoCLIP, including a reported 20% improvement at the 1 km threshold.
  • VidTAG also achieves a reported 25% improvement over the state of the art on the global coarse-grained CityGuessr68k benchmark, suggesting strong scalability advantages versus image-gallery-based retrieval.
  • The authors position the method as enabling practical fine-grained video-to-GPS trajectory estimation with applications such as forensics, social media analysis, and exploration.

Abstract

The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/