DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale

arXiv cs.LG / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common dimensionality-reduction methods like UMAP and t-SNE can optimize for local neighborhoods in ways that preserve sampling noise while distorting the data’s global topology.
  • It reports that top-performing embeddings can “memorize” noise, producing artificial features such as cycles and disconnected islands that are not present in the original data.
  • The authors introduce a topology-faithfulness benchmark using noisy manifolds with known homology, and use it to tune DiRe for better global-topology preservation.
  • Experiments show DiRe can match or outperform GPU-accelerated UMAP on classification tasks while also recovering exact first Betti numbers on topology stress tests.
  • On a large-scale test of 723K arXiv paper embeddings, DiRe is claimed to preserve 3–4× more topological structure than UMAP at comparable wall-clock time.

Abstract

Dimensionality reduction methods such as UMAP and t-SNE are central tools for visualising high-dimensional data, but their local-neighborhood objectives can preserve sampling noise while distorting global topology. We show that standard local metrics reward this noise memorisation: top-performing embeddings invent cycles and disconnected islands absent from the data. We introduce a topology-faithfulness benchmark based on noisy manifolds with known homology, tune DiRe against it, and find Pareto-optimal configurations that match or beat GPU-accelerated UMAP on classification while recovering exact first Betti numbers on stress tests. On 723K arXiv paper embeddings, DiRe preserves 3-4 times more topological structure than UMAP at comparable wall-clock.