GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Graph self-supervised learning on graphs is computationally expensive, but the paper finds that subsampling 50% of graphs can preserve over 96% of downstream performance due to redundancy.
  • It proposes GraphSculptor, a label-free method for building pre-training coresets by combining intrinsic structural signals with contextual semantic signals derived from graph-to-text descriptions.
  • Structural diversity is computed from intrinsic graph statistics, while semantic diversity is obtained by encoding generated graph descriptions using a pre-trained language model.
  • GraphSculptor merges both views into a unified metric space and uses cluster-aware selection to maintain joint structural-semantic diversity.
  • The authors provide a theoretical loss-gap bound and show experimentally that a 10% coreset can reach 99.6% of full-data performance while cutting pre-training time by nearly 90%.
  • GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning

Abstract

Graph self-supervised learning typically relies on large-scale unlabeled datasets, heavily inflating computational costs. However, empirical evidence suggests that these datasets contain substantial redundancy-our analysis reveals that uniformly subsampling 50% of graphs retains over 96% of downstream performance. To exploit this redundancy, we introduce GraphSculptor for pre-training coreset construction. Unlike methods dependent on additional training-time signals or limited solely to topological statistics, GraphSculptor provides a label-free solution that constructs coresets via two complementary perspectives: intrinsic structure and contextual semantics. Concretely, structural diversity is quantified using intrinsic graph statistics, yielding a structural feature vector for each graph, while semantic diversity is captured by utilizing a pre-trained language model to encode descriptions generated via graph-to-text. GraphSculptor integrates these signals into a unified metric space and performs cluster-aware selection to preserve joint structural-semantic diversity. We further derive a theoretical bound on the loss gap between coreset and full-data pre-training, offering theoretical motivation for our selection formulation. Extensive experiments demonstrate that GraphSculptor effectively sculpts the dataset: a 10% coreset achieves 99.6% of full-data performance while reducing pre-training time by nearly 90%, offering a scalable solution for data-efficient graph pre-training.