On Interpolation Formulas Describing Neural Network Generalization

arXiv cs.LG / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The work extends Domingos' interpolation formula to stochastic training by introducing a stochastic gradient kernel via a continuous-time diffusion approximation.
It proves stochastic versions of Domingos' theorems and shows the expected network output has a kernel-machine representation with optimizer-specific weighting, reflecting loss-dependent contributions and gradient alignment along training trajectories.
It links generalization error to the null space of the integral operator induced by the stochastic gradient kernel, and provides a unified interpretation of diffusion models and GANs as stage-wise corrections shaped by geometry.
It presents numerical experiments illustrating the evolution of implicit kernels during optimization, supporting a feature-space memory view where test predictions arise from kernel-weighted retrieval of stored tangent features.

Abstract

In 2020 Domingos introduced an interpolation formula valid for "every model trained by gradient descent". He concluded that such models behave approximately as kernel machines. In this work, we extend the Domingos formula to stochastic training. We introduce a stochastic gradient kernel that extends the deterministic version via a continuous-time diffusion approximation. We prove stochastic Domingos theorems and show that the expected network output admits a kernel-machine representation with optimizer-specific weighting. It reveals that training samples contribute through loss-dependent weights and gradient alignment along the training trajectory. We then link the generalization error to the null space of the integral operator induced by the stochastic gradient kernel. The same path-kernel viewpoint provides a unified interpretation of diffusion models and GANs: diffusion induces stage-wise, noise-localized corrections, whereas GANs induce distribution-guided corrections shaped by discriminator geometry. We visualize the evolution of implicit kernels during optimization and quantify out-of-distribution behaviors through a series of numerical experiments. Our results support a feature-space memory view of learning: training stores data-dependent information in an evolving tangent feature geometry, and predictions at test time arise from kernel-weighted retrieval and aggregation of these stored features, with generalization governed by alignment between test points and the learned feature memory.