GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

arXiv cs.LG / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces GlowQ, a group-shared low-rank correction method designed to improve the accuracy of quantized LLMs at low bit-widths (e.g., 4-bit) where standard quantization methods like BitsAndBytes, AWQ, and GPTQ can degrade performance.
Unlike prior low-rank correction approaches that restore or add error-correction modules in every decoder block, GlowQ caches a single shared right factor per input-sharing group and selectively restores only the groups/layers that provide the largest accuracy gains.
GlowQ computes an expensive high-precision projection once per input-sharing group and reuses it across modules, aiming to reduce parameter/memory overhead while preserving layer-specific expressivity.
The selective variant GlowQ-S applies the cached shared module only to the locations with the highest benefit, achieving larger performance gains while keeping downstream accuracy nearly unchanged.
Reported results show GlowQ reduces TTFB by 5.6% and increases throughput by 9.6% on average, while GlowQ-S further cuts TTFB by 23.4% and boosts throughput by 37.4% with minimal accuracy loss (within ~0.2 percentage points on average).

Abstract

Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER, QERA, ASER) has been proposed to mitigate this issue, however, they restore all layers and insert error-correction modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per input-sharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by (5.6%) and increases throughput by (9.6%) on average, while reducing perplexity on WikiText-2 by (0.17%) and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by (23.4%) and increasing throughput by (37.4%), while maintaining accuracy within 0.2 percentage points on average.

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Dev.to

Neural Networks in Mobile Robot Motion

Dev.to

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Key Points

Abstract

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Neural Networks in Mobile Robot Motion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer