The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

arXiv cs.LG / 3/12/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a coherent rank-one mean bias as the primary driver of numerical instability in FP4-quantized LLM training, caused by blockwise quantization scales reacting to extreme activation magnitudes.
This mean bias emerges systematically across layers and training stages and accounts for most extreme activation magnitudes, inflating dynamic range and compressing long-tail semantic variation.
It can be removed with a simple source-level mean subtraction, avoiding heavy spectral methods while using standard quantization kernels.
Empirical FP4 results show that mean removal narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

Abstract

Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

Dev.to

Nvidia GTC 2026: Jensen Huang Eyes $1 Trillion in Orders as the AI Infrastructure Race Hits Warp Speed

Dev.to

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

Nvidia GTC 2026: Jensen Huang Eyes $1 Trillion in Orders as the AI Infrastructure Race Hits Warp Speed

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer