Discrete Causal Representation Learning

arXiv stat.ML / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Discrete Causal Representation Learning (DCRL), a generative framework aimed at uncovering causal relationships among discrete latent variables from noisy, entangled observations.
DCRL uses a directed acyclic graph over discrete latent variables plus a sparse bipartite graph connecting latents to observed variables, enabling interpretability and flexibility across mixed data types (continuous, count, and binary).
The authors provide identifiability results, showing that—under mild conditions—both the latent causal graph and the bipartite measurement graph can be recovered from the observed data distribution alone.
A three-stage pipeline (estimate, resample latent configurations, then perform score-based causal discovery) is proposed, with consistency guarantees for recovering the latent causal structure.
Experiments on educational assessment and synthetic image datasets indicate that DCRL can recover sparse, interpretable latent causal structures.

Abstract

Causal representation learning seeks to uncover causal relationships among high-level latent variables from low-level, entangled, and noisy observations. Existing approaches often either rely on deep neural networks, which lack interpretability and formal guarantees, or impose restrictive assumptions like linearity, continuous-only observations, and strong structural priors. These limitations particularly challenge applications with a large number of discrete latent variables and mixed-type observations. To address these challenges, we propose discrete causal representation learning (DCRL), a generative framework that models a directed acyclic graph among discrete latent variables, along with a sparse bipartite graph linking latent and observed layers. This design accommodates continuous, count, and binary responses through flexible measurement models while maintaining interpretability. Under mild conditions, we prove that both the bipartite measurement graph and the latent causal graph are identifiable from the observed data distribution alone. We further propose a three-stage estimate-resample-discovery pipeline: penalized estimation of the generative model parameters, resampling of latent configurations from the fitted model, and score-based causal discovery on the resampled latents. We establish the consistency of this procedure, ensuring reliable recovery of the latent causal structure. Empirical studies on educational assessment and synthetic image data demonstrate that DCRL recovers sparse and interpretable latent causal structures.