SeekerGym: A Benchmark for Reliable Information Seeking

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SeekerGym, a new benchmark focused on evaluating the completeness of information retrieved by AI agents, not just whether retrieved content is relevant or correct.
  • It assesses both retrieval completeness (how much of a source document’s relevant sections are found) and uncertainty calibration (how well agents quantify what might be missing when retrieval is incomplete).
  • SeekerGym tasks are defined as documents (e.g., Wikipedia pages or machine learning survey papers), and agents must issue queries to retrieve relevant passages from those documents.
  • Benchmark results show that even the best methods still retrieve only 42.5% of passages on Wikipedia and 29.2% on ML survey papers, indicating significant room to improve reliable information seeking.
  • The authors highlight that incomplete retrieval can introduce user-facing bias and mislead users even when the returned information is individually correct and relevant.

Abstract

Despite their substantial successes, AI agents continue to face fundamental challenges in terms of trustworthiness. Consider deep research agents, tasked with searching for information relevant to a given topic-while AI agents can perform effective information retrieval, there is little guarantee regarding the completeness of this information. Gaps in retrieved information can leave biases that mislead users even if the information they are given is correct and relevant. We introduce SeekerGym, a benchmark designed to evaluate the completeness of information retrieved by AI agents. In addition, SeekerGym also measures how well agents quantify their uncertainty in the completeness of their information; if an agent fails to retrieve all relevant information, it is useful for it to at least quantify how much might be missing. At a high level, each task in SeekerGym is a document (e.g., a Wikipedia article), and the AI agent must issue queries to retrieve passages from that document. Intuitively, the document comprehensively covers a topic, so the ability to retrieve its sections directly measures completeness of information retrieval. In addition to Wikipedia, we also consider machine learning survey papers, where the goal is to retrieve relevant sections of a survey paper. We benchmark several models and algorithms; the best approaches retrieve 42.5% of passages on Wikipedia and 29.2% on ML Surveys, leaving substantial room for improvement.