OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes how activation outliers in 4-bit LLM inference are not randomly distributed but instead exhibit token-persistent clustering, consistently occupying fixed channels across tokens.
  • It proposes OSC (Outlier Separation in Channel dimension), an offline channel-detection and online dual-path inference method to suppress outliers while keeping most computation in low precision.
  • OSC performs 4-bit GEMM for the main path and a 16-bit branch for identified outlier channels, using structured sub-tensor extraction to gather sparse outlier channels into a compact dense tensor for efficient high-throughput GEMM.
  • For cases where outlier clustering is weaker (notably for W2 inputs), OSC includes a fallback to FP8 to maintain accuracy.
  • Experiments on Qwen3-8B and Qwen3-30B show limited average accuracy degradation (2.19 and 1.12 points) and a hardware-friendly peak speedup of 1.78x versus a W8A8 GEMM baseline on modern accelerators.

Abstract

While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.