AI Navigate

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

arXiv cs.LG / 3/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • We propose Distribution-Aware Piecewise Activation (DAPA), a differentiable, hardware-friendly activation for Transformer models on edge devices that leverages the distribution of pre-activation data.
  • DAPA uses a non-uniform piecewise approximation with finer segments in high-probability regions to improve generalization over prior piecewise-linear methods.
  • It is quantized using Distribution-Weighted Mean Square Error to reduce latency and resource usage for hardware deployment.
  • An HLS implementation shows that DAPA speeds up GELU computation by 16x and cuts DSP utilization by 16x, while maintaining or improving performance on vision Transformers and GPT-2.

Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16\times and decreases DSP utilization by 16\times while maintaining comparable or better performance across vision Transformers and GPT-2 models.