Deep Learning using Rectified Linear Units (ReLU)

arXiv stat.ML / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper corrects a citation record by tracing the mathematical lineage of piecewise linear activations and reaffirming that the definitive deep-learning integration of ReLU is attributed to Nair & Hinton (2010), not a 2018 version that only studied ReLU at the classification layer.
  • It provides a robust empirical comparison of ReLU, Tanh, and Sigmoid across image classification, text classification, and image reconstruction tasks using 10 independent randomized trials and Kruskal–Wallis statistical testing.
  • The results support the theoretical limitations of saturating activations: Sigmoid fails to converge in deep convolutional vision settings due to the vanishing-gradient problem, performing at chance-level accuracy.
  • ReLU and Tanh both show stable convergence, with ReLU achieving the highest mean accuracy and F1-score for classification tasks, while Tanh delivers the best peak signal-to-noise ratio for reconstruction.
  • Overall, the study concludes there is a statistically significant performance gap between activation functions in deep networks and reinforces the practical need for non-saturating activations in deep architectures.

Abstract

The Rectified Linear Unit (ReLU) is a foundational activation function in artficial neural networks. Recent literature frequently misattributes its origin to the 2018 (initial) version of this paper, which exclusively investigated ReLU at the classification layer. This paper formally corrects the citation record by tracing the mathematical lineage of piecewise linear functions from early biological models to their definitive integration into deep learning by Nair & Hinton (2010). Alongside this historical rectification, we present a comprehensive empirical comparison of the ReLU, Hyperbolic Tangent (Tanh), and Logistic (Sigmoid) activation functions across image classification, text classification, and image reconstruction tasks. To ensure statistical robustness, we evaluated these functions using 10 independent randomized trials and assessed significance using the non-parametric Kruskal-Wallis H test. The empirical data validates the theoretical limitations of saturating functions. Sigmoid failed to converge in deep convolutional vision tasks due to the vanishing gradient problem, thus yielding accuracies equivalent to random probability. Conversely, ReLU and Tanh exhibited stable convergence. ReLU achieved the highest mean accuracy and F1-score on image classification and text classification tasks, while Tanh yielded the highest peak signal to noise ratio in image reconstruction. Ultimately, this study confirms a statistically significant performance variance among activations, thus reaffirming the necessity of non-saturating functions in deep architectures, and restores proper historical attribution to prior literature.