VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

VidTAG is a proposed dual-encoder framework for fine-grained video geolocalization that retrieves frame-to-GPS correspondences using self-supervised and language-aligned features.
The work addresses temporal inconsistency in video predictions by introducing TempGeo for aligning frame embeddings and GeoRefiner (an encoder–decoder) for refining GPS features based on those aligned embeddings.
Experiments on Mapillary (MSLS) and GAMa show temporally consistent trajectory generation and results that outperform GeoCLIP, including a reported 20% improvement at the 1 km threshold.
VidTAG also achieves a reported 25% improvement over the state of the art on the global coarse-grained CityGuessr68k benchmark, suggesting strong scalability advantages versus image-gallery-based retrieval.
The authors position the method as enabling practical fine-grained video-to-GPS trajectory estimation with applications such as forensics, social media analysis, and exploration.

Abstract

The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/

v0.20.0rc1

vLLM Releases

How to Learn Claude AI from Scratch (Step-by-Step Guide)

Dev.to

Biotech-led boom as 8 China firms flock to Hong Kong’s thriving stock market

SCMP Tech

I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw

Dev.to

LLMs Don't Fail — Execution Does: Why Agentic AI Needs a Control Layer

Dev.to

VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Key Points

Abstract

Related Articles

v0.20.0rc1

How to Learn Claude AI from Scratch (Step-by-Step Guide)

Biotech-led boom as 8 China firms flock to Hong Kong’s thriving stock market

I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw

LLMs Don't Fail — Execution Does: Why Agentic AI Needs a Control Layer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer