Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

arXiv stat.ML / 4/28/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes learning low-degree spherical polynomials on the unit sphere using an over-parameterized two-layer neural network equipped with learnable channel attention.
  • It proves improved sample complexity: with high probability, a carefully designed finite-width network trained by vanilla gradient descent needs only n = Θ(d^{ℓ0}/ε) samples, compared with a prior bound of Θ(d^{ℓ0}·max{ε^{-2}, log d}).
  • The authors show the bound is essentially tight by deriving a sharp generalization rate for the nonparametric regression risk of order Θ(d^{ℓ0}/n), indicating no further improvement in sample complexity.
  • They further argue minimax optimality by matching the network’s regression risk rate to the minimax optimal rate for kernel methods with rank Θ(d^{ℓ0}).
  • The training is performed in two stages: a provable learnable channel (harmonic-degree) selection step recovers the true degree ℓ0 from L ≥ ℓ0 channels, followed by standard GD training of the second layer on the selected channels.

Abstract

We study the problem of learning a low-degree spherical polynomial of degree \ell_0 = \Theta(1) \ge 1 defined on the unit sphere in \RR^d by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk \eps \in (0,1), a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of n \asymp \Theta(d^{\ell_0}/\eps) with high probability, in contrast with the representative sample complexity \Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}, where n is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order \Theta(d^{\ell_0}/{n}) with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank \Theta(d^{\ell_0}) is \Theta(d^{\ell_0}/{n}), so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, \ell_0, from L \ge \ell_0 channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.