High-dimensional Semi-supervised Classification via the Fermat Distance

arXiv stat.ML / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies semi-supervised classification in high-dimensional settings, assuming data lie near manifolds and in clusters with few labels but abundant unlabeled samples.
  • It introduces Fermat-distance-based methods, including a weighted k-nearest neighbors classifier and classifiers induced by multidimensional scaling (MDS), where increasing the target dimension enables effective use of linear classifiers on manifold data.
  • The authors provide theoretical guarantees, deriving sharp lower bounds on expected excess risk within clusters and proving that the weighted k-NN classifier using the true Fermat distance is minimax optimal.
  • They quantify how unlabeled data improves performance: the error from estimating the Fermat distance decreases exponentially with the pooled sample size, which is faster than previously reported rates.
  • Experiments on both synthetic and real datasets show that the proposed approaches are competitive with or outperform leading graph-based semi-supervised classifiers.

Abstract

Semi-supervised classification, where unlabeled data are massive but labeled data are limited, often arises in machine learning applications. We address this challenge under high-dimensional data by leveraging the manifold and cluster assumptions. Based on the Fermat distance, a density-sensitive metric that naturally encodes the cluster assumption, we propose the weighted k-nearest neighbors (NN) classifier and multidimensional scaling (MDS)-induced classifiers. The use of MDS with a large target dimension allows the effective application of linear classifiers to complex manifold data. Theoretically, we derive a sharp lower bound for the expected excess risk within clusters and prove that the weighted k-NN classifier utilizing the true Fermat distance is minimax optimal. Furthermore, we explicitly quantify the utility of unlabeled data by showing that the error arising from estimating the Fermat distance decays exponentially with the pooled sample size. Such a rate is much faster than the related rates in the literature. Extensive experiments on synthetic and real datasets demonstrate competitive or superior performance of our approaches compared to state-of-the-art graph-based semi-supervised classifiers.

High-dimensional Semi-supervised Classification via the Fermat Distance | AI Navigate