BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

arXiv cs.LG / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes BWTA (Binary Weights & Ternary Activations), a binarized/ultra-low-bit Transformer quantization method that reduces zero-point distortion and better preserves accuracy at extremely low bit-widths.
  • It introduces Smooth Multi-Stage Quantization for training stability and fast convergence, combining levelwise degradation and a magnitude-alignment projection factor.
  • For inference, the authors develop a custom BWTA MatMul CUDA kernel with efficient bit-packing and binary/ternary implementations that target both linear and attention operators across Transformer architectures.
  • Reported results indicate near full-precision performance for BERT (with small GLUE drops) and competitive perplexity/accuracy for LLMs, while delivering large speedups (e.g., 16–24× kernel-level over FP16) and improved end-to-end prefill throughput.
  • Overall, the work demonstrates algorithm–hardware co-design for low-latency ultra-low-bit Transformer inference without major quality loss.

Abstract

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.