Reconsidering Dependency Networks from an Information Geometry Perspective

arXiv cs.LG / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a gap in the theoretical foundations of dependency networks by analyzing pseudo-Gibbs sampling using information geometry.
  • It interprets each pseudo-Gibbs sampling step as an m-projection onto a full conditional manifold and introduces the full conditional divergence to study how the stationary distribution is positioned.
  • The authors derive an upper bound characterizing the stationary distribution’s location in probability space and reformulate both structure and parameter learning as optimization problems.
  • Structure and parameter learning are shown to decompose into independent per-node subproblems, making the learning formulation more tractable.
  • The paper proves that the learned model distribution converges to the true underlying distribution as training data size goes to infinity, with experiments indicating the bound is tight in practice.

Abstract

Dependency networks (Heckerman et al., 2000) provide a flexible framework for modeling complex systems with many variables by combining independently learned local conditional distributions through pseudo-Gibbs sampling. Despite their computational advantages over Bayesian and Markov networks, the theoretical foundations of dependency networks remain incomplete, primarily because their model distributions -- defined as stationary distributions of pseudo-Gibbs sampling -- lack closed-form expressions. This paper develops an information-geometric analysis of pseudo-Gibbs sampling, interpreting each sampling step as an m-projection onto a full conditional manifold. Building on this interpretation, we introduce the full conditional divergence and derive an upper bound that characterizes the location of the stationary distribution in the space of probability distributions. We then reformulate both structure and parameter learning as optimization problems that decompose into independent subproblems for each node, and prove that the learned model distribution converges to the true underlying distribution as the number of training samples grows to infinity. Experiments confirm that the proposed upper bound is tight in practice.