Gradient Regularized Newton Boosting Trees with Global Convergence

arXiv stat.ML / 5/4/2026

📰 NewsModels & Research

Key Points

  • The paper tackles a gap in understanding the global convergence behavior of Newton boosting for Gradient Boosting Decision Trees (GBDTs), which are widely used in tools like XGBoost, LightGBM, and CatBoost.
  • It introduces “Restricted Newton Descent,” an optimization framework for Newton’s method on Hilbert spaces with inexact iterates, using concepts such as cosine angle and weak gradient edge.
  • For smooth, strongly convex losses with a Hessian-dominance condition, the authors prove that vanilla Newton boosting converges linearly.
  • For more general convex losses with Lipschitz Hessians, they propose a gradient-regularized Newton scheme for the restricted weak learner setting that adds an adaptive L2 regularization term based on the gradient norm.
  • The resulting algorithm achieves a global convergence rate of O(1/k^2) for a second-order GBDT method, and experiments indicate it can converge where vanilla Newton boosting may diverge.

Abstract

Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite its empirical success, the global convergence of Newton boosting is poorly understood compared to first-order boosting. In this paper, we introduce Restricted Newton Descent, which studies convex optimization with Newton's method on Hilbert spaces with inexact iterates, based on the concepts of cosine angle and weak gradient edge. Within this framework, we recover Newton boosting with GBDTs and classical finite-dimensional theory as special cases. We first prove that vanilla Newton boosting achieves a linear rate of convergence for smooth, strongly convex losses that satisfy a Hessian-dominance condition. To handle general convex losses with Lipschitz Hessians, we extend a recent gradient regularized Newton scheme to the restricted weak learner setting. This scheme minimally modifies the classical algorithm by introducing an adaptive \ell_2-regularization term proportional to the square root of the gradient norm at each iteration. We establish a \mathcal{O}(\frac{1}{k^2}) rate for this scheme, thereby obtaining a globally convergent second-order GBDT algorithm with a rate matching that of first-order boosting with Nesterov momentum. In numerical experiments, we show that our scheme converges while vanilla Newton boosting may diverge.