Multigrade Neural Network Approximation

arXiv stat.ML / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes multigrade deep learning (MGDL) as a structured training framework that refines approximation error hierarchically by adding residual blocks one “grade” at a time while freezing previously learned components.
It develops an operator-theoretic foundation and proves that for any continuous target function, there exists a fixed-width multigrade ReLU architecture whose residuals decrease strictly across grades and converge uniformly to zero.
The authors position grade-wise training as a way to obtain stability and optimization guarantees that are known for certain shallow one-hidden-layer ReLU models but are difficult to achieve when scaling to deep networks.
Experiments are reported to illustrate the theoretical claims about vanishing approximation error and hierarchical error reduction.
Overall, the work claims a first rigorous theoretical guarantee (to the authors’ knowledge) that grade-wise training yields provable vanishing approximation error in deep networks.

Abstract

We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly non-convex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably one-hidden-layer

\texttt{ReLU}

models, training admits convex reformulations with global guarantees, motivating learning paradigms that improve stability while scaling to depth. MGDL builds upon this insight by training deep networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce the remaining approximation error, yielding an interpretable and stable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function, there exists a fixed-width multigrade

\texttt{ReLU}

scheme whose residuals decrease strictly across grades and converge uniformly to zero. To the best of our knowledge, this work provides the first rigorous theoretical guarantee that grade-wise training yields provable vanishing approximation error in deep networks. Numerical experiments further illustrate the theoretical results.