Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

arXiv stat.ML / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies learning low-degree spherical polynomials on the unit sphere using an over-parameterized two-layer neural network with an added/augmented feature representation.
  • It introduces a new training method, Gradient Descent with Projection (GDP), and proves improved sample complexity: for target regression risk ε, the required number of samples scales roughly as n ≍ log(4/δ)·d^{k0}/ε with high probability.
  • The authors show this rate is nearly unimprovable by relating the network’s achieved regression risk to a nonparametric rate of order log(4/δ)·d^{k0}/n.
  • They compare against minimax optimal performance for regression with a kernel of rank Θ(d^{k0}), concluding the GDP-trained network attains a nearly minimax optimal rate.
  • For the practical challenge where the true polynomial degree k0 is unknown, the paper provides a provable adaptive degree-selection algorithm that recovers k0 and preserves the nearly optimal regression rate; it also claims novelty in obtaining nearly optimal bounds using ReLU with algorithmic guarantees, beyond the NTK regime.

Abstract

We study the problem of learning a low-degree spherical polynomial of degree k_0 = \Theta(1) \ge 1 defined on the unit sphere in \RR^d by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk \eps \in (0, \Theta(d^{-k_0})], an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of n \asymp \Theta( \log(4/\delta) \cdot d^{k_0}/\eps) with probability 1-\delta for \delta \in (0,1), in contrast with the representative sample complexity \Theta(d^{k_0} \max\set{\eps^{-2},\log d}). Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order \log({4}/{\delta}) \cdot \Theta(d^{k_0}/{n}) with probability at least 1-\delta. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank \Theta(d^{k_0}) is \Theta(d^{k_0}/{n}), so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree k_0 is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.