Identifiable Deep Latent Variable Models for MNAR Data

arXiv stat.ML / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses biased results in missing-not-at-random (MNAR) settings, pointing out that many deep-learning imputation approaches assume missing-at-random (MAR) and can fail when this assumption is violated.
It introduces a framework using deep latent variable models and proves identifiability of the underlying data distribution under a “conditional no self-censoring” assumption given latent variables.
The authors develop an estimation method based on importance-weighted autoencoders to learn unknown parameters efficiently.
The work provides theoretical and empirical evidence that the proposed approach can recover the true joint distribution, subject to specified regularity conditions.
Extensive simulations and real-world experiments show improved performance over classical and contemporary MNAR imputation baselines.

Abstract

Missing data is a ubiquitous challenge in data analysis, often leading to biased and inaccurate results. Traditional imputation methods usually assume that the missingness mechanism is missing-at-random (MAR), where the missingness is independent of the missing values themselves. This assumption is frequently violated in real-world scenarios, prompted by recent advances in imputation methods using deep learning to address this challenge. However, these methods neglect the crucial issue of nonparametric identifiability in missing-not-at-random (MNAR) data, which can lead to biased and unreliable results. This paper seeks to bridge this gap by proposing a novel framework based on deep latent variable models for {MNAR data}. Building on the assumption of conditional no self-censoring {given} latent variables, we establish the identifiability of the data distribution. This crucial theoretical result guarantees the feasibility of our approach. To effectively estimate unknown parameters, we develop an efficient algorithm utilizing importance-weighted autoencoders. We demonstrate, both theoretically and empirically, that our estimation process accurately recovers the ground-truth joint distribution under specific regularity conditions. Extensive simulation studies and real-world data experiments showcase the advantages of our proposed method compared to various classical and state-of-the-art approaches to missing data imputation.