TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

arXiv cs.CV / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 画像の見た目だけでなく、撮影地(ジオロケーション)と撮影時刻(時間)を同時に扱う「Geo-Time Aware Image Retrieval」を定義し、関連するベンチマーク(学習用4.5M、評価用86k)を構築した。
  • TIGeRはマルチモーダル・トランスフォーマーにより、画像・地理情報・時間を統一のジオ時空埋め込み空間へ写像し、単一または複数モダリティ入力にも対応する。
  • TIGeRは同一表現を用いて、(i) ジオローカライゼーション、(ii) 撮影時刻の予測、(iii) ジオ時空を条件にした検索(指定時刻で同一ロケーションの画像を引く等)を行える。
  • 大きな外観変化があっても場所の同一性をより良く保持できることで、「視覚類似性のみ」ではなく「どこで・いつ撮られたか」に基づく検索が可能になり、従来手法より最大16%(年)、8%(時刻)、14%(検索リコール)で改善した。

Abstract

Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.