Abstract
We study the problem of learning a low-degree spherical polynomial of degree \ell_0 = \Theta(1) \ge 1 defined on the unit sphere in \RR^d by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk \eps \in (0,1), a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of n \asymp \Theta(d^{\ell_0}/\eps) with high probability, in contrast with the representative sample complexity \Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}, where n is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order \Theta(d^{\ell_0}/{n}) with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank \Theta(d^{\ell_0}) is \Theta(d^{\ell_0}/{n}), so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, \ell_0, from L \ge \ell_0 channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.