Variable Selection Using Relative Importance Rankings

arXiv stat.ML / 4/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper reframes relative importance (RI) analysis—traditionally used for post-hoc model explanation—into a pre-model workflow for feature/variable ranking and filter-based selection.
  • It argues RI measures should outperform marginal correlation by capturing both direct and combined predictor effects, thereby accounting for dependencies among variables.
  • The authors introduce a new RI metric, CRI.Z, and show it improves computational efficiency versus conventional RI measures.
  • Extensive simulations indicate RI-based rankings are more accurate than marginal correlation, particularly under suppressed or weak predictors, and models trained using RI-selected variables are highly competitive versus lasso/relaxed lasso.
  • The method also performs well in difficult regimes with clusters of highly correlated predictors and is validated on two high-dimensional gene-expression datasets, with accompanying open-source code.

Abstract

Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable or feature ranking and filter-based selection before model creation. Specifically, we anticipate strong performance from the RI measures because they incorporate both direct and combined effects of predictors, addressing a key limitation of marginal correlation, which ignores dependencies among predictors. We implement and evaluate the RI-based variable ranking and selection methods, including a newly proposed RI measure, CRI.Z, with improved computational efficiency relative to conventional RI measures. Through extensive simulations, we first demonstrate how the RI measures more accurately rank the variables than the marginal correlation, especially when there are suppressed or weak predictors. We then show that predictive models built on these rankings are highly competitive, often outperforming state-of-the-art linear-model methods such as the lasso and relaxed lasso. The proposed RI-based methods are particularly effective in challenging cases involving clusters of highly correlated predictors, a setting known to cause failures in many benchmark methods. The practical utility and efficiency of RI-based methods are further demonstrated through two high-dimensional gene expression datasets. Although lasso methods have dominated the recent literature on variable selection, our study reveals that the RI-based method is a powerful and competitive alternative. We believe these underutilized tools deserve greater attention in statistics and machine learning communities. The code is available at: https://github.com/tien-endotchang/RI-variable-selection.