AXELRAM: Quantize Once, Never Dequantize

arXiv cs.LG / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

AXELRAM is proposed as a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices, avoiding KV dequantization via a fixed, design-time codebook based on orthogonal transforms.
The approach uses an asymmetric write/read path—transforming on write and then using table lookup on read—reportedly cutting per-query multiplications by 102.4×.
Experiments across 10 random seeds and three models show mixed stability: some models (e.g., Qwen2.5-3B) can exhibit catastrophic perplexity spikes (Δ > 50), indicating strong sign-pattern sensitivity in quantized KV caches.
The authors attribute the failures to layer-wise norm heterogeneity and introduce a gradient-free, one-time sign pattern selection using a small calibration set (200 candidates, 8 samples) that prevents catastrophic spikes without adding hardware cost.
The paper is posted on arXiv with code released publicly at the provided GitHub repository, enabling replication and further evaluation.

Abstract

We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.