Two-sample comparison through additive tree models for density ratios

arXiv stat.ML / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles two-sample comparison by estimating the density ratio between two distributions from i.i.d. samples, using additive tree models.
  • It introduces a new training objective called “balancing loss,” which enables tree models to be optimized with supervised-learning algorithms such as forward-stagewise optimization and gradient boosting.
  • The balancing loss is shown to relate to exponential-family kernels and can be used as a pseudo-likelihood, enabling generalized Bayesian inference via backfitting samplers for Bayesian additive regression trees (BART).
  • The authors provide uncertainty quantification for the estimated density ratio, and they link the balancing loss to binary classification losses and to variational forms of f-divergences (notably squared Hellinger distance).
  • Experiments indicate improved accuracy and computational efficiency, and the method is demonstrated on evaluating generative models for microbiome compositional data.

Abstract

The ratio of two densities provides a direct characterization of their differences. We consider the two-sample comparison problem by estimating this ratio given i.i.d. observations from two distributions. To this end, we propose additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss. The loss allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting. Moreover, the balancing loss resembles an exponential family kernel, and it can serve as a pseudo-likelihood with conjugate priors. This property enables generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). Our Bayesian strategy provides uncertainty quantification for the inferred density ratio, which is critical for applications involving high-dimensional and data-limited distributions with potentially substantial uncertainty. We further show connections of the balancing loss to the exponential loss in binary classification and to the variational form of f-divergence, particularly the squared Hellinger distance. Numerical experiments demonstrate that our method achieves both accuracy and computational efficiency, while uniquely providing uncertainty quantification. Finally, we demonstrate its application to assessing the quality of generative models for microbiome compositional data.