Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes three vector-quantization (VQ) techniques to compress neural network weights while keeping training feasible end-to-end.
  • To address codebook collapse and stabilize learning, it replaces typical assignment with cosine-similarity-based assignment and uses a top-1 sampling strategy together with a straight-through estimator.
  • By combining cosine similarity with differentiable K-Means (DKM)-inspired attention-like formulations, the method avoids weighted-average reconstruction.
  • It also explores differentiable neural architecture search (NAS) to choose layer-by-layer quantization configurations automatically, aiming to improve compression quality.
  • Results show the approach may not consistently beat existing methods at every quantization level, but it clarifies key design trade-offs and behavioral patterns of VQ-based compression.

Abstract

In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model compression methods.