Exponential families from a single KL identity

arXiv cs.LG / 5/1/2026

📰 NewsModels & Research

Key Points

  • The paper presents a single key KL-divergence identity for exponential families that relates KL differences to the log-partition function A(λ) and the moment μ_q.
  • By combining this identity with only the nonnegativity of KL divergence, the authors derive multiple classical results (e.g., generalized three-point identity, Pythagorean theorems for I-projections).
  • The derivations also recover key structural properties of exponential families, including convexity of A(λ), its Legendre dual expressed via KL, and the Gibbs variational principle.
  • The note further shows how the same framework yields optimization formulas relevant to KL-regularized reward maximization, including the exponential tilting identity used in entropy-regularized control and RLHF.
  • Additional analytic consequences include the gradient formula for A(λ), a Bregman representation for within-family KL, and surjectivity of the moment map.

Abstract

Exponential families encompass the distributions central to modern machine learning -- softmax, Gaussians, and Boltzmann distributions -- and underlie the theory of variational inference, entropy-regularized reinforcement learning, and RLHF. We isolate a simple identity for exponential families that expresses the KL difference \mathrm{KL}(q \| p_{\lambda_2}) - \mathrm{KL}(q \| p_{\lambda_1}) in terms of the log-partition function A(\lambda) and the moment \mu_q. Remarkably, this identity together with the single fact that \mathrm{KL} \geq 0 (with equality iff p = q) suffices, by direct substitution and rearrangement, to derive a cluster of results that are classically obtained by separate, heavier arguments: a generalized three-point identity for arbitrary reference distributions, Pythagorean theorems for I-projections and reverse I-projections, convexity of the log-partition function, identification of its Legendre dual in KL terms, the Gibbs variational principle, and the explicit optimizer in KL-regularized reward maximization, including the exponential tilting formula underlying entropy-regularized control and RLHF. Beyond these purely algebraic consequences, standard analytic arguments recover the gradient formula for the log-partition function, the Bregman representation of within-family KL divergence, and the surjectivity of the moment map. The note is self-contained.