Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization
arXiv cs.LG / 3/19/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates collapses in vector quantization used for tokenizing data in generative models, identifying collapse phenomena across discrete codebook tokens and continuous latent embeddings.
- It uses both synthetic and real datasets to quantify the severity of each type of collapse and to establish triggering conditions.
- The study finds that random initialization and limited encoder capacity contribute to token and embedding collapses, respectively.
- The authors propose potential mitigation strategies aimed at addressing each type of collapse and mark this as the first comprehensive analysis of representation collapsing problems in vector quantization.




