Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

arXiv cs.LG / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the Geometric Monomial (GEM) family of rational, highly smooth activation functions that are C^{2N}-differentiable and aim to match ReLU-like optimization behavior without relying on non-rational computation.
  • Three variants are proposed—GEM (base), E-GEM (epsilon-parameterized to enable arbitrary L^p approximation of ReLU), and SE-GEM (piecewise form designed to remove dead neurons while maintaining C^{2N} junction smoothness).
  • Ablation and experiments indicate that N=1 is optimal for standard-depth networks, while the preferred smoothness order depends on architecture type: deep CNNs favor N=1 and transformers favor N=2.
  • GEM-family activations improve performance on multiple benchmarks, including beating GELU on CIFAR-10 with ResNet-56 via SE-GEM (92.51% vs 92.44%), reducing the GELU deficit on CIFAR-100 + ResNet-56 to as little as 0.62%, and achieving best perplexity on GPT-2 (72.57 vs 73.76 for GELU).
  • The epsilon parameter in E-GEM shows a scale-dependent optimum, with small epsilon values tending to work best for deep CNNs and larger epsilons benefiting smaller/shallower transformers such as BERT-small (best validation loss reported at 6.656 with epsilon=10).

Abstract

The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of C^{2N}-smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic. We introduce three variants: GEM (the base family), E-GEM (an \epsilon-parameterized generalization enabling arbitrary L^p-approximation of ReLU), and SE-GEM (a piecewise variant eliminating dead neurons with C^{2N} junction smoothness). An N-ablation study establishes N=1 as optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The smoothness parameter N further reveals a CNN-transformer tradeoff: N=1 is preferred for deep CNNs, while N=2 is preferred for transformers. On MNIST, E-GEM ties the best baseline (99.23%). On CIFAR-10 + ResNet-56, SE-GEM (\epsilon=10^{-4}) surpasses GELU (92.51% vs 92.44%) -- the first GEM-family activation to outperform GELU. On CIFAR-100 + ResNet-56, E-GEM reduces the GELU deficit from 6.10% (GEM N=2) to just 0.62%. On GPT-2 (124M), GEM achieves the lowest perplexity (72.57 vs 73.76 for GELU), with GEM N=1 also beating GELU (73.32). On BERT-small, E-GEM (\epsilon=10) achieves the best validation loss (6.656) across all activations. The \epsilon-parameterization reveals a scale-dependent optimum: small \epsilon (10^{-4}--10^{-6}) for deep CNNs and larger transformers, with the special case of small transformers (BERT-small) benefiting from large \epsilon (\epsilon=10) due to its limited depth and unconstrained gradients.