Abstract
Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm D^\mathtt{AD}. We first show that given a floating-point function \phi (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^\mathtt{AD}(\phi\circ f), respectively. We further extend this result: given \phi_1,\dots,\phi_n, D^\mathtt{AD}(\phi_i\circ f) can simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., \mathrm{ReLU}, \mathrm{ELU}, \mathrm{GeLU}, \mathrm{Swish}, \mathrm{Sigmoid}, and \mathrm{tanh}.