Assessing the impact of dimensionality reduction on clustering performance -- a systematic study

arXiv cs.LG / 4/27/2026

💬 OpinionModels & Research

Key Points

  • The paper conducts a systematic evaluation of how five dimensionality reduction methods (PCA, Kernel PCA, VAE, Isomap, and MDS) affect clustering performance on high-dimensional data.
  • It benchmarks four clustering algorithms (k-means, agglomerative hierarchical clustering, GMM, and OPTICS) using the Adjusted Rand Index (ARI) to compare results with and without dimensionality reduction.
  • The study tests multiple reduction levels suggested in prior literature—k−1 dimensions, and 25% and 50% of the original dimensionality—to measure how aggressiveness of reduction changes outcomes.
  • Results indicate that both the choice of dimensionality reduction technique and the reduction target level must be selected to match the data’s intrinsic geometry and the specific clustering algorithm.
  • 5. The work highlights remaining gaps in comprehensive cross-method, cross-data-type assessment for dimensionality reduction in clustering pipelines.

Abstract

Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact across diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.