OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
arXiv cs.LG / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes how activation outliers in 4-bit LLM inference are not randomly distributed but instead exhibit token-persistent clustering, consistently occupying fixed channels across tokens.
- It proposes OSC (Outlier Separation in Channel dimension), an offline channel-detection and online dual-path inference method to suppress outliers while keeping most computation in low precision.
- OSC performs 4-bit GEMM for the main path and a 16-bit branch for identified outlier channels, using structured sub-tensor extraction to gather sparse outlier channels into a compact dense tensor for efficient high-throughput GEMM.
- For cases where outlier clustering is weaker (notably for W2 inputs), OSC includes a fallback to FP8 to maintain accuracy.
- Experiments on Qwen3-8B and Qwen3-30B show limited average accuracy degradation (2.19 and 1.12 points) and a hardware-friendly peak speedup of 1.78x versus a W8A8 GEMM baseline on modern accelerators.
Related Articles

Black Hat Asia
AI Business
Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention
Dev.to
I scanned every major vibe coding tool for security. None scored above 90.
Dev.to
I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.
Dev.to
Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?
Reddit r/artificial