Object-Level Explanations for Image Geolocation Models: a GeoGuessr use-case

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper explores whether image geolocation models base their predictions on object-level visual cues (e.g., road markings, vegetation, and building details) similar to how humans play GeoGuessr.
It introduces an object-centric analysis pipeline that turns standard attribution maps into segmented, object-like elements by extracting salient regions from the attributions.
The method evaluates which inferred elements matter by using deletion and insertion tests, comparing attribution-guided crops against randomly chosen regions with comparable area coverage.
Experiments on a three-country benchmark find that attribution-guided crops preserve more predictive information than random crops, indicating that attribution maps can be decomposed into interpretable perceptible elements.
The authors propose this as a step toward more object-level explanations for image geolocation models, beyond diffuse heatmap-style attribution.

Abstract

When humans play geolocation games such as GeoGuessr, they rely on concrete visual cues, such as road markings, vegetation, or architectural details, to infer where an image was captured. Whether image geolocation models rely on similar object-level evidence remains difficult to determine, as attribution methods like Grad-CAM typically highlight diffuse regions rather than coherent visual entities, making it difficult to link model predictions to specific objects or perceptible patterns. In this work, we propose an object-centric analysis pipeline to investigate the visual evidence used by geolocation models. Starting from attribution maps, we extract salient regions and segment them into object-like elements. We evaluate their predictive relevance through deletion and insertion tests, comparing attributionguided crops to randomly selected regions with similar coverage. Experiments on a three-country benchmark show that attribution-guided crops consistently retain more information for the model's prediction than random crops. These results suggest that attribution maps can be decomposed into interpretable, perceptible elements, providing a step toward object-level analysis of geolocation models.