AXELRAM: Quantize Once, Never Dequantize
arXiv cs.LG / 4/6/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- AXELRAM is proposed as a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices, avoiding KV dequantization via a fixed, design-time codebook based on orthogonal transforms.
- The approach uses an asymmetric write/read path—transforming on write and then using table lookup on read—reportedly cutting per-query multiplications by 102.4×.
- Experiments across 10 random seeds and three models show mixed stability: some models (e.g., Qwen2.5-3B) can exhibit catastrophic perplexity spikes (Δ > 50), indicating strong sign-pattern sensitivity in quantized KV caches.
- The authors attribute the failures to layer-wise norm heterogeneity and introduce a gradient-free, one-time sign pattern selection using a small calibration set (200 candidates, 8 samples) that prevents catastrophic spikes without adding hardware cost.
- The paper is posted on arXiv with code released publicly at the provided GitHub repository, enabling replication and further evaluation.




