Geometric Layer-wise Approximation Rates for Deep Networks

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles a gap in standard neural network approximation theory by giving a quantitative, scale-dependent interpretation of how depth improves approximation quality across intermediate layers.
  • It proposes a single shared, fixed-width mixed-activation architecture that supports any chosen finite depth, ensuring that every intermediate readout 66 is itself a valid approximant of the target function.
  • For targets in L^p([0,1]^d), the approximation error at layer 66 is bounded using the (2d+1) factor times the L^p modulus of continuity evaluated at the geometric scale N^(-66).
  • The bound specializes to a geometric convergence rate (2d+1)N^(-66) when the target function is 1-Lipschitz, and the design is motivated by multigrade deep learning with nested residual-style refinement.
  • The approach yields an adaptive, progressive refinement capability where later readouts reuse earlier correction terms, avoiding redesign of the network’s earlier parts.

Abstract

Depth is widely viewed as a central contributor to the success of deep neural networks, whereas standard neural network approximation theory typically provides guarantees only for the final output and leaves the role of intermediate layers largely unclear. We address this gap by developing a quantitative framework in which depth admits a precise scale-dependent interpretation. Specifically, we design a single shared mixed-activation architecture of fixed width 2dN+d+2 and any prescribed finite depth such that each intermediate readout \Phi_\ell is itself an approximant to the target function f. For f\in L^p([0,1]^d) with p\in [1,\infty), the approximation error of \Phi_\ell is controlled by (2d+1) times the L^p modulus of continuity at the geometric scale N^{-\ell} for all \ell. The estimate reduces to the geometric rate (2d+1)N^{-\ell} if f is 1-Lipschitz. Our network design is inspired by multigrade deep learning, where depth serves as a progressive refinement mechanism: each new correction targets residual information at a finer scale while the earlier correction terms remain part of the later readouts, yielding a nested architecture that supports adaptive refinement without redesigning the preceding network.