Natural gradient descent with momentum

arXiv cs.AI / 4/20/2026

💬 OpinionModels & Research

共有:

Key Points

The paper studies optimization on nonlinear manifolds by viewing natural gradient descent as a form of preconditioned gradient descent from a functional (not purely parameter) perspective.
It explains that a natural gradient (NGD) step uses the Gram matrix of the tangent-space generating system—rather than the Hessian—to produce a locally optimal update in function space via a projected gradient onto the manifold’s tangent space.
The authors note limitations of both standard gradient and natural gradient methods, including getting stuck in local minima and producing suboptimal update directions when the model class is nonlinear or the loss is poorly conditioned.
They propose a natural analogue of inertial optimization methods (Heavy-Ball and Nesterov) and demonstrate that this can improve the learning process for nonlinear model classes.
The work is positioned as a methodological advance for optimization in settings such as neural networks with differentiable activations and other differentiable parametrizations like tensor networks.

Abstract

We consider the problem of approximating a function by an element of a nonlinear manifold which admits a differentiable parametrization, typical examples being neural networks with differentiable activation functions or tensor networks. Natural gradient descent (NGD) for the optimization of a loss function can be seen as a preconditioned gradient descent where updates in the parameter space are driven by a functional perspective. In a spirit similar to Newton's method, a NGD step uses, instead of the Hessian, the Gram matrix of the generating system of the tangent space to the approximation manifold at the current iterate, with respect to a suitable metric. This corresponds to a locally optimal update in function space, following a projected gradient onto the tangent space to the manifold. Still, both gradient and natural gradient descent methods get stuck in local minima. Furthermore, when the model class is a nonlinear manifold or the loss function is not ideally conditioned (e.g., the KL-divergence for density estimation, or a norm of the residual of a partial differential equation in physics informed learning), even the natural gradient might yield non-optimal directions at each step. This work introduces a natural version of classical inertial dynamic methods like Heavy-Ball or Nesterov and show how it can improve the learning process when working with nonlinear model classes.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

Space now with memory

Dev.to

Natural gradient descent with momentum

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Space now with memory

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer