Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates how common LLM quantization techniques affect factual knowledge recall (FKR), a key capability related to how models retrieve stored knowledge.
Using multiple quantization methods across different bit widths and interpretability-driven analyses, the authors find that quantization usually causes information loss that reduces FKR.
The negative impact is especially pronounced for smaller models within the same architectural families, though lower-bit quantized models are not always worse.
In some cases, quantization can even improve FKR, and the study reports that BitSandBytes preserves FKR best relative to full-precision baselines.
Overall, quantization leads to modest performance degradation on FKR while still functioning as an effective model compression approach, with results varying by model and method.

Abstract

Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization's effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model's FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.