Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Key Points

The paper analyzes one-pass stochastic gradient descent (SGD) in a teacher–student two-layer neural network with quadratic activations, deriving low-dimensional ODEs that track overlap dynamics in the high-dimensional limit.

Abstract

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension

N

and the number of samples

M

diverge at fixed ratio

\alpha = M/N

, and for finite hidden widths

(p,p^*)

of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization (

p>p^*

) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for

p>1

. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.