Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data

arXiv stat.ML / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets clustering of nested/hierarchical data where both group-level covariates and observation-level covariates must be modeled jointly.
  • Using the OneK1K scRNA-seq dataset (982 individuals, 1.27M cells) as motivation, the authors aim to cluster both cells and individuals while incorporating individual-specific genotype information.
  • They introduce the Nested Atoms Model (NAM), a Bayesian nonparametric framework designed to perform two-layer clustering that accounts for heterogeneity at the individual (group) and cell (observation) levels.
  • To make NAM practical for high-dimensional genomics data, the authors develop a fast variational Bayesian inference algorithm for scaling inference.
  • Experiments and simulations indicate NAM outperforms approaches that ignore group-level variables, and application to OneK1K yields individual clusters with homogeneous cell-type profiles that align with known immune cell types.

Abstract

We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.