OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes how activation outliers in 4-bit LLM inference are not randomly distributed but instead exhibit token-persistent clustering, consistently occupying fixed channels across tokens.
It proposes OSC (Outlier Separation in Channel dimension), an offline channel-detection and online dual-path inference method to suppress outliers while keeping most computation in low precision.
OSC performs 4-bit GEMM for the main path and a 16-bit branch for identified outlier channels, using structured sub-tensor extraction to gather sparse outlier channels into a compact dense tensor for efficient high-throughput GEMM.
For cases where outlier clustering is weaker (notably for W2 inputs), OSC includes a fallback to FP8 to maintain accuracy.
Experiments on Qwen3-8B and Qwen3-30B show limited average accuracy degradation (2.19 and 1.12 points) and a hardware-friendly peak speedup of 1.78x versus a W8A8 GEMM baseline on modern accelerators.

Abstract

While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.

Black Hat Asia

AI Business

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

Dev.to

I scanned every major vibe coding tool for security. None scored above 90.

Dev.to

I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.

Dev.to

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

Reddit r/artificial

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

Key Points

Abstract

Related Articles

Black Hat Asia

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

I scanned every major vibe coding tool for security. None scored above 90.

I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer