A Deep Generative Approach to Stratified Learning

arXiv stat.ML / 4/14/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that many datasets are better represented as stratified spaces (unions of manifolds with varying dimensions) rather than a single manifold, and frames stratified learning as a challenge due to varying dimensionality and intersection singularities.
  • It introduces two deep generative frameworks for learning distributions on stratified spaces: a sieve maximum likelihood method using a dimension-aware mixture of VAEs, and a diffusion-based method that leverages the score-field structure of a mixture.
  • The authors provide theoretical convergence rates for learning both ambient and intrinsic distributions, showing that performance depends on intrinsic dimensions and strata smoothness as well as ambient noise.
  • Beyond distribution learning, the work analyzes the score field geometry to establish consistency guarantees for estimating intrinsic dimensions per stratum and proposes an algorithm to infer the number of strata and their dimensions.
  • Extensive simulations and real-data experiments, including molecular dynamics, are used to demonstrate the effectiveness of the proposed approaches.

Abstract

While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying dimensionality, intersection singularities, and lack of efficient models in learning the underlying distributions. We provide a deep generative approach to stratified learning by developing two generative frameworks for learning distributions on stratified spaces. The first is a sieve maximum likelihood approach realized via a dimension-aware mixture of variational autoencoders. The second is a diffusion-based framework that explores the score field structure of a mixture. We establish the convergence rates for learning both the ambient and intrinsic distributions, which are shown to be dependent on the intrinsic dimensions and smoothness of the underlying strata. Utilizing the geometry of the score field, we also establish consistency for estimating the intrinsic dimension of each stratum and propose an algorithm that consistently estimates both the number of strata and their dimensions. Theoretical results for both frameworks provide fundamental insights into the interplay of the underlying geometry, the ambient noise level, and deep generative models. Extensive simulations and real dataset applications, such as molecular dynamics, demonstrate the effectiveness of our methods.