Abstract
We study the problem of learning a low-degree spherical polynomial of degree k_0 = \Theta(1) \ge 1 defined on the unit sphere in \RR^d by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk \eps \in (0, \Theta(d^{-k_0})], an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of n \asymp \Theta( \log(4/\delta) \cdot d^{k_0}/\eps) with probability 1-\delta for \delta \in (0,1), in contrast with the representative sample complexity \Theta(d^{k_0} \max\set{\eps^{-2},\log d}). Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order \log({4}/{\delta}) \cdot \Theta(d^{k_0}/{n}) with probability at least 1-\delta. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank \Theta(d^{k_0}) is \Theta(d^{k_0}/{n}), so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree k_0 is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.