Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

arXiv cs.LG / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether the classic universal approximation results for differentiable functions and their gradients still hold when neural networks run under real floating-point arithmetic and round-off error.
  • It proves that for a given floating-point function (such as a loss function) and any desired floating-point-valued outputs and input gradients, there exists a floating-point neural network f whose automatically differentiated quantity D^AD(φ∘f) can represent those gradients.
  • The authors extend the result to multiple functions φ1,…,φn, showing that D^AD(φi∘f) can simultaneously realize arbitrary gradients for each i while f represents the target function values under mild assumptions.
  • The theoretical guarantees apply to commonly used practical activation functions, including ReLU, ELU, GeLU, Swish, Sigmoid, and tanh.
  • Overall, the work provides a floating-point-and-automatic-differentiation analogue of universal approximation for both function values and gradients, bridging a gap between idealized real-parameter theory and implementable numerical computation.

Abstract

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm D^\mathtt{AD}. We first show that given a floating-point function \phi (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^\mathtt{AD}(\phi\circ f), respectively. We further extend this result: given \phi_1,\dots,\phi_n, D^\mathtt{AD}(\phi_i\circ f) can simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., \mathrm{ReLU}, \mathrm{ELU}, \mathrm{GeLU}, \mathrm{Swish}, \mathrm{Sigmoid}, and \mathrm{tanh}.