AI Navigate

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

arXiv cs.LG / 3/19/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates collapses in vector quantization used for tokenizing data in generative models, identifying collapse phenomena across discrete codebook tokens and continuous latent embeddings.
  • It uses both synthetic and real datasets to quantify the severity of each type of collapse and to establish triggering conditions.
  • The study finds that random initialization and limited encoder capacity contribute to token and embedding collapses, respectively.
  • The authors propose potential mitigation strategies aimed at addressing each type of collapse and mark this as the first comprehensive analysis of representation collapsing problems in vector quantization.

Abstract

Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we systematically investigate the issue of collapses in vector quantization, where collapsed representations are observed across discrete codebook tokens and continuous latent embeddings. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that random initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.