Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

arXiv cs.LG / 3/19/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates collapses in vector quantization used for tokenizing data in generative models, identifying collapse phenomena across discrete codebook tokens and continuous latent embeddings.
It uses both synthetic and real datasets to quantify the severity of each type of collapse and to establish triggering conditions.
The study finds that random initialization and limited encoder capacity contribute to token and embedding collapses, respectively.
The authors propose potential mitigation strategies aimed at addressing each type of collapse and mark this as the first comprehensive analysis of representation collapsing problems in vector quantization.

Abstract

Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we systematically investigate the issue of collapses in vector quantization, where collapsed representations are observed across discrete codebook tokens and continuous latent embeddings. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that random initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

AI Cybersecurity

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Dev.to

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Key Points

Abstract

Related Articles

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

AI Cybersecurity

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer