A Scalable Nystrom-Based Kernel Two-Sample Test with Permutations

arXiv stat.ML / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses two-sample hypothesis testing and proposes a scalable approach for determining whether two datasets come from the same distribution.
  • It improves the practicality of maximum mean discrepancy (MMD)-based testing for large-scale settings by using a Nyström approximation of the MMD.
  • The authors provide finite-sample theoretical guarantees, including a bound on the test’s power when the two distributions are sufficiently separated in MMD.
  • They show that the derived separation rate achieves the known minimax-optimal rate for this problem setting.
  • Numerical experiments demonstrate the method’s applicability to realistic scientific data and highlight computational efficiency.

Abstract

Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nystr\"om approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing applicability to realistic scientific data.