Hierarchical Probabilistic Principal Component Analysis of Longitudinal Data

arXiv stat.ML / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing probabilistic PCA methods (e.g., PPCA) are not well-suited for longitudinal datasets that are both high-dimensional and have substantial missingness.
  • It proposes hierarchical probabilistic principal component analysis (HPPCA), a two-level probabilistic factor model that separates between-subject variability from time-varying within-subject dynamics.
  • HPPCA models within-subject latent factors using a Gaussian process and introduces an EM algorithm designed to handle missing data and flexible covariance kernels efficiently.
  • Simulation results show HPPCA substantially improves imputation accuracy over standard PPCA and multivariate functional PCA, even with heavy missingness and when the model is misspecified.
  • In a long COVID symptoms application, HPPCA captures hierarchical structure effectively and improves prediction of clinical outcomes and the recovery of masked clinical records compared with existing methods.

Abstract

In many longitudinal studies, a large number of variables are measured repeatedly over time, with substantial missing data. Existing methods, such as probabilistic principal component analysis (PPCA), are ill-equipped to handle such incomplete, high-dimensional longitudinal data, as they fail to account for the nested sources of variation and temporal dependency inherent in repeated measures. We introduce hierarchical probabilistic principal component analysis (HPPCA), a two-level probabilistic factor model that explicitly separates between-subject variance from time-varying within-subject dynamics. The within-subject latent factors are modeled by a Gaussian process. We develop an EM algorithm to handle missing data and flexible covariance kernels, accelerated by computationally efficient initializers. Simulation studies demonstrated that HPPCA robustly recovers model parameters subspaces and substantially outperforms both standard PPCA and multivariate functional PCA in imputation accuracy, even under heavy missingness and model misspecification. An application to the long COVID symptoms in the Researching COVID to Enhance Recovery adult cohort revealed that HPPCA effectively captured the data's hierarchical structure and its learned features significantly improved the prediction of clinical outcomes and the recovery of masked clinical records compared to exisiting methods.