EVGeoQA: Benchmarking LLMs on Dynamic, Multi-Objective Geo-Spatial Exploration

arXiv cs.AI / 4/10/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces EVGeoQA, a new benchmark for evaluating LLMs in dynamic, real-time geo-spatial exploration rather than static retrieval, using EV charging scenarios tied to a user’s current coordinates.
  • EVGeoQA uses a dual-objective setup—balancing charging necessity with a preferred co-located activity—to better reflect real-world planning constraints.
  • To assess performance in these complex settings, the authors propose GeoRover, a tool-augmented agent evaluation framework designed to measure multi-objective exploration capabilities.
  • Experiments show that LLMs can use tools for sub-tasks but still struggle with long-range spatial exploration, indicating a key limitation in their navigation-like reasoning.
  • The study also reports an emergent behavior where LLMs summarize prior exploration trajectories to improve future exploration efficiency, and it releases the dataset and prompts publicly.

Abstract

While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, their potential for purpose-driven exploration in dynamic geo-spatial environments remains under-investigated. Existing Geo-Spatial Question Answering (GSQA) benchmarks predominantly focus on static retrieval, failing to capture the complexity of real-world planning that involves dynamic user locations and compound constraints. To bridge this gap, we introduce EVGeoQA, a novel benchmark built upon Electric Vehicle (EV) charging scenarios that features a distinct location-anchored and dual-objective design. Specifically, each query in EVGeoQA is explicitly bound to a user's real-time coordinate and integrates the dual objectives of a charging necessity and a co-located activity preference. To systematically assess models in such complex settings, we further propose GeoRover, a general evaluation framework based on a tool-augmented agent architecture to evaluate the LLMs' capacity for dynamic, multi-objective exploration. Our experiments reveal that while LLMs successfully utilize tools to address sub-tasks, they struggle with long-range spatial exploration. Notably, we observe an emergent capability: LLMs can summarize historical exploration trajectories to enhance exploration efficiency. These findings establish EVGeoQA as a challenging testbed for future geo-spatial intelligence. The dataset and prompts are available at https://github.com/Hapluckyy/EVGeoQA/.