AI Navigate

Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that graph foundation model benchmarking should address two dimensions—topic domains and format domains—whereas prior benchmarks mostly varied only topic domains.
  • It introduces a new benchmark that jointly evaluates semantic generalization and robustness to representational shifts across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation.
  • The protocol defines four evaluation settings to isolate knowledge transfer across topics and formats: (i) diverse topics and formats with unseen downstream datasets, (ii) diverse topics and formats with seen datasets, (iii) a single topic with adaptation to other topics, and (iv) a base format with adaptation to other formats.
  • The study conducts extensive experiments evaluating eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfaceing new empirical observations and practical insights for future work.
  • Code and data for the benchmark are publicly available at the linked GitHub repository.

Abstract

Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.