BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

arXiv cs.LG / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes BWTA (Binary Weights & Ternary Activations), a binarized/ultra-low-bit Transformer quantization method that reduces zero-point distortion and better preserves accuracy at extremely low bit-widths.
It introduces Smooth Multi-Stage Quantization for training stability and fast convergence, combining levelwise degradation and a magnitude-alignment projection factor.
For inference, the authors develop a custom BWTA MatMul CUDA kernel with efficient bit-packing and binary/ternary implementations that target both linear and attention operators across Transformer architectures.
Reported results indicate near full-precision performance for BERT (with small GLUE drops) and competitive perplexity/accuracy for LLMs, while delivering large speedups (e.g., 16–24× kernel-level over FP16) and improved end-to-end prefill throughput.
Overall, the work demonstrates algorithm–hardware co-design for low-latency ultra-low-bit Transformer inference without major quality loss.

Abstract

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer