Linear Models, Variable Selection, Artificial Intelligence

arXiv stat.ML / 5/1/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper reviews long-standing variable selection challenges in linear regression and contrasts common approaches like stepwise selection, AIC/BIC penalized likelihood, and coefficient-penalized methods such as LASSO and Elastic Net.
  • It proposes an AI-based model selection method that trains an ANN to assess variable significance using OLS estimates.
  • Simulation experiments evaluate accuracy across different sample sizes and variances, showing how the method performs under varying data conditions.
  • Additional simulations benchmark the ANN approach against Forward/Backward selection, AIC, BIC, and LASSO.
  • The authors demonstrate the method on a World Health Organization life expectancy dataset and provide a GitHub link with a pretrained ANN supporting up to 100 predictors, along with the original and subset datasets.

Abstract

Variable selection in linear regression models has been a problem since hypothesis testing began. Which variables to include or exclude from a model is not an easy task. Techniques such as Forward, Back ward, Stepwise Regression sequentially add or delete variables from a model. Penalized likelihood methods such as AIC, BIC, etc. seek to choose variables that have a significant contribution to the likelihood. Penalized sum of square methods such as LASSO and Elastic Net have been used to penalize small coefficients to only allow variables with large coefficients in the model. This work introduces an Artificial Intelligence approach to model selection where an ANN is trained to determine the significance of the variables based on OLS estimates. A simulation study shows the accuracy across various sample sizes and variances. Furthermore, a simulation study is conducted to compare the performance of the approach against Forward, Backward, AIC, BIC and LASSO. The approach is illustrated using a dataset from the World Health Organization regarding Life Expectancy. A github link is provided to the pretrained ANN that can handle up to 100 predictor variables, the original WHO dataset and the subset used in this work.