Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition

arXiv cs.AI / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The authors study entrywise scalar quantization of A and B before matrix multiplication and derive the matrix multiplication MSE under a pair-i.i.d. inner-product model, obtaining a sharp K^{-2} asymptotic expansion with exact optimal constants in the high-resolution regime.
  • For correlated Gaussian pairs, they derive a closed-form optimal center density lambda*(u) proportional to exp(-u^2/6) ((1-ρ^2)+ρ^2 u^2)^{1/3}, with u = x/σ_X, and a symmetric result for y.
  • They identify a correlation-driven phase transition: the density is unimodal at the origin for |ρ| ≤ 1/√3 and becomes bimodal for |ρ| > 1/√3 with peaks at u_peak = ±√(3-1/ρ^2).
  • The paper demonstrates applicability in synthetic experiments on matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.
  • These results offer practical guidance for designing quantizers to minimize AB-MSE in ML deployments, potentially improving efficiency.

Abstract

We study entrywise scalar quantization of two matrices prior to multiplication. Given A\in R^{m\times k} and B\in R^{k\times n}, we quantize entries of A and B independently using scalar quantizers with K_X and K_Y levels per entry, and form \widehat C=\widehat A\,\widehat B. The objective is to minimize the matrix multiplication mean-squared error (MSE) E[\|{AB-\widehat A\widehat B}\|_F^2] under a pair-i.i.d.\ inner-product model. In the high-resolution regime K_X,K_Y\to\infty, we derive a sharp K^{-2} asymptotic expansion for \mathcal{E}, identify the exact optimal leading constants, and characterize asymptotically optimal quantization center densities in terms of conditional second moments. We then specialize to correlated Gaussian multiplicative pairs, obtaining a closed-form optimal point density \[ \lambda^\star(u)\ \propto\ \exp\!\left(-\frac{u^2}{6}\right)\bigl((1-\rho^2)+\rho^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{\sigma_X}, \] with the same form for y/\sigma_Y, and prove a correlation-driven phase transition: the density is unimodal at the origin for |\rho|\leq 1/\sqrt{3} and becomes bimodal for |\rho|>1/\sqrt{3} with peaks at u_{\mathrm{peak}}=\pm\sqrt{3-1/\rho^2}. We show our method's applicability in synthetic experiments such as matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.