High-dimensional Semi-supervised Classification via the Fermat Distance

arXiv stat.ML / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies semi-supervised classification in high-dimensional settings, assuming data lie near manifolds and in clusters with few labels but abundant unlabeled samples.
It introduces Fermat-distance-based methods, including a weighted k-nearest neighbors classifier and classifiers induced by multidimensional scaling (MDS), where increasing the target dimension enables effective use of linear classifiers on manifold data.
The authors provide theoretical guarantees, deriving sharp lower bounds on expected excess risk within clusters and proving that the weighted k-NN classifier using the true Fermat distance is minimax optimal.
They quantify how unlabeled data improves performance: the error from estimating the Fermat distance decreases exponentially with the pooled sample size, which is faster than previously reported rates.
Experiments on both synthetic and real datasets show that the proposed approaches are competitive with or outperform leading graph-based semi-supervised classifiers.

Abstract

Semi-supervised classification, where unlabeled data are massive but labeled data are limited, often arises in machine learning applications. We address this challenge under high-dimensional data by leveraging the manifold and cluster assumptions. Based on the Fermat distance, a density-sensitive metric that naturally encodes the cluster assumption, we propose the weighted

k

-nearest neighbors (NN) classifier and multidimensional scaling (MDS)-induced classifiers. The use of MDS with a large target dimension allows the effective application of linear classifiers to complex manifold data. Theoretically, we derive a sharp lower bound for the expected excess risk within clusters and prove that the weighted

k

-NN classifier utilizing the true Fermat distance is minimax optimal. Furthermore, we explicitly quantify the utility of unlabeled data by showing that the error arising from estimating the Fermat distance decays exponentially with the pooled sample size. Such a rate is much faster than the related rates in the literature. Extensive experiments on synthetic and real datasets demonstrate competitive or superior performance of our approaches compared to state-of-the-art graph-based semi-supervised classifiers.

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

Dev.to

High-dimensional Semi-supervised Classification via the Fermat Distance

Key Points

Abstract

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer