From Moments to Models: Graphon-Mixture Learning for Mixup and Contrastive Learning

arXiv stat.ML / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a unified framework for modeling real-world graphs as mixtures of generative graph models using graphons and estimating their components from graph moments (motif densities).
  • It introduces a theoretical guarantee that graphs from structurally similar graphons have similar motif densities with high probability, supporting principled graphon-mixture estimation.
  • The authors show that conditioning on the inferred generative mixture components improves two downstream paradigms: graphon-mixture-aware mixup (GMAM) for augmentation and model-aware graph contrastive learning (MGCL).
  • Experiments on simulated and real datasets indicate GMAM achieves new state-of-the-art supervised accuracy on 6 of 7 datasets, while MGCL is competitive in unsupervised benchmarks and ranks best on average.

Abstract

Real-world graph datasets often arise from mixtures of populations, where graphs are generated by multiple distinct underlying distributions. In this work, we propose a unified framework that explicitly models graph data as a mixture of probabilistic graph generative models represented by graphons. To characterize and estimate these graphons, we leverage graph moments (motif densities) to cluster graphs generated from the same underlying model. We establish a novel theoretical guarantee, deriving a tighter bound showing that graphs sampled from structurally similar graphons exhibit similar motif densities with high probability. This result enables principled estimation of graphon mixture components. We show how incorporating estimated graphon mixture components enhances two widely used downstream paradigms: graph data augmentation via mixup and graph contrastive learning. By conditioning these methods on the underlying generative models, we develop graphon-mixture-aware mixup (GMAM) and model-aware graph contrastive learning (MGCL). Extensive experiments on both simulated and real-world datasets demonstrate strong empirical performance. In supervised learning, GMAM outperforms existing augmentation strategies, achieving new state-of-the-art accuracy on 6 out of 7 datasets. In unsupervised learning, MGCL performs competitively across seven benchmark datasets and achieves the lowest average rank overall.