Only relative ranks matter in weight-clustered large language models

arXiv cs.LG / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper demonstrates that the relative rank of weights, not their exact magnitudes, largely determines LLM performance, enabling training-free compression by clustering weights to K shared values (16-64 per matrix) for models like Llama 3.1-8B-Instruct and SmolLM2-135M.
Reducing each weight matrix to 16-64 distinct values preserves accuracy without retraining, and optionally fine-tuning only the centroids recovers about 30-40% of the remaining accuracy gap at minimal cost.
Scrambling the cluster means—i.e., changing the rank—degrades quality sharply, while rank-preserving randomizations cause little loss in mid/late layers, highlighting rank as the critical factor.
When many layers are perturbed, scale drift rather than rank distortion is the dominant collapse mechanism; an affine correction w' = aw + b with a > 0 that preserves rank order and distribution can substantially delay this drift, offering a new lens on model compression and robustness.

Abstract

Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w' = aw + b with a > 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.