Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

arXiv stat.ML / 4/28/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes learning low-degree spherical polynomials on the unit sphere using an over-parameterized two-layer neural network equipped with learnable channel attention.
It proves improved sample complexity: with high probability, a carefully designed finite-width network trained by vanilla gradient descent needs only n = Θ(d^{ℓ0}/ε) samples, compared with a prior bound of Θ(d^{ℓ0}·max{ε^{-2}, log d}).
The authors show the bound is essentially tight by deriving a sharp generalization rate for the nonparametric regression risk of order Θ(d^{ℓ0}/n), indicating no further improvement in sample complexity.
They further argue minimax optimality by matching the network’s regression risk rate to the minimax optimal rate for kernel methods with rank Θ(d^{ℓ0}).
The training is performed in two stages: a provable learnable channel (harmonic-degree) selection step recovers the true degree ℓ0 from L ≥ ℓ0 channels, followed by standard GD training of the second layer on the selected channels.

Abstract

We study the problem of learning a low-degree spherical polynomial of degree

\ell_0 = \Theta(1) \ge 1

defined on the unit sphere in

\RR^d

by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk

\eps \in (0,1)

, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of

n \asymp \Theta(d^{\ell_0}/\eps)

with high probability, in contrast with the representative sample complexity

\Theta\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}

, where

n

is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order

\Theta(d^{\ell_0}/{n})

with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank

\Theta(d^{\ell_0})

\Theta(d^{\ell_0}/{n})

, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function,

\ell_0

, from

L \ge \ell_0

channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

Dev.to

Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Key Points

Abstract

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer