Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

arXiv cs.LG / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper highlights the need to benchmark LLM reasoning limits beyond small, fully visible graphs, since real-world graph data is often much larger and only partially accessible.
  • It introduces a new large-graph benchmark dataset called EstGraph, along with four tasks aimed at estimating large-scale graph properties.
  • The researchers evaluate multiple LLMs on these tasks across a variety of graph datasets, focusing on how well models can infer global properties from limited context.
  • To address context-length constraints, the paper proposes task-specific prompt construction methods that use random-walk sampling from very large graphs (up to millions of nodes) to provide sufficient information to the LLMs.

Abstract

With the rapidly improving reasoning abilities of Large Language Models (LLMs), there is also a rising demand to use them in a wide variety of domains. This brings about the need to carefully evaluate the limits of the capabilities of these models with various tests and benchmarks. Graph structures are ubiquitous in real-world data, and are often used to represent and analyze relationship patterns within data. Many benchmarks have already been proposed in the graph literature to test the reasoning ability of LLMs to follow and execute graph algorithms. However, due to the limited context length of LLMs, these benchmarks consist of very small graphs. In real-world data, the size of graphs can be significantly larger, and in many cases, not fully accessible. In this paper, we examine a class of problems that arises with very large graphs having limited accessibility. We propose a large graph benchmark dataset, EstGraph, and introduce four distinct tasks designed to estimate large graph properties. We evaluate the reasoning abilities of LLMs on these tasks using a wide variety of graph datasets. In addition, we provide task-specific prompt constructions based on random walk sampling of large graphs (up to millions of nodes) that effectively convey sufficient information to LLMs within the limits of context length.