An Interdisciplinary and Cross-Task Review on Missing Data Imputation

arXiv stat.ML / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Missing data remains a major obstacle to analysis and decision-making across many domains, and the current research landscape is fragmented across fields and methods.
  • The review bridges statistical foundations with modern machine learning by systematically covering missingness mechanisms, single vs. multiple imputation, imputation goals, and domain-specific problem characteristics.
  • It categorizes imputation approaches from classical methods (e.g., regression, EM) to modern techniques including low/high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and even large language model-based methods.
  • Special emphasis is placed on handling complex data types (tensors, time series, streaming, graphs, categorical, and multimodal data) and on how imputation should integrate with downstream tasks like classification, clustering, and anomaly detection.
  • The article also evaluates theoretical guarantees, benchmarks, and metrics, and highlights future challenges such as model/hyperparameter selection, privacy-preserving imputation via federated learning, and developing generalizable models across domains and data types.

Abstract

Missing data is a fundamental challenge in data science, significantly hindering analysis and decision-making across a wide range of disciplines, including healthcare, bioinformatics, social science, e-commerce, and industrial monitoring. Despite decades of research and numerous imputation methods, the literature remains fragmented across fields, creating a critical need for a comprehensive synthesis that connects statistical foundations with modern machine learning advances. This work systematically reviews core concepts-including missingness mechanisms, single versus multiple imputation, and different imputation goals-and examines problem characteristics across various domains. It provides a thorough categorization of imputation methods, spanning classical techniques (e.g., regression, the EM algorithm) to modern approaches like low-rank and high-rank matrix completion, deep learning models (autoencoders, GANs, diffusion models, graph neural networks), and large language models. Special attention is given to methods for complex data types, such as tensors, time series, streaming data, graph-structured data, categorical data, and multimodal data. Beyond methodology, we investigate the crucial integration of imputation with downstream tasks like classification, clustering, and anomaly detection, examining both sequential pipelines and joint optimization frameworks. The review also assesses theoretical guarantees, benchmarking resources, and evaluation metrics. Finally, we identify critical challenges and future directions, emphasizing model selection and hyperparameter optimization, the growing importance of privacy-preserving imputation via federated learning, and the pursuit of generalizable models that can adapt across domains and data types, thereby outlining a roadmap for future research.