Optimal low-rank stochastic gradient estimation for LLM training

arXiv cs.LG / 2026/3/24

💬 オピニオンIdeas & Deep AnalysisModels & Research

共有:

要点

The paper proposes an unbiased, memory-efficient low-rank stochastic gradient estimator for LLM training that targets both memory bottlenecks and stochastic gradient noise in high-dimensional parameter spaces.
It reduces memory by projecting a high-dimensional gradient estimator onto a randomly chosen low-dimensional subspace and then lifting it back, while controlling mean-squared error through an optimally designed projection distribution.
The optimal random projector is derived via a constrained functional optimization problem, including Haar–Stiefel projections, to guide how the projection should be sampled during training.
Experiments on RoBERTa-large fine-tuning show substantially lower peak GPU memory (e.g., 3.83GB vs 16.7GB for full backprop) with competitive accuracy.
In autoregressive LLM pretraining (LLaMA-20M/60M/100M), the method reportedly outperforms traditional low-rank/estimation approaches, indicating improved training behavior from the proposed projection strategy.

Abstract

Large language model (LLM) training is often bottlenecked by memory constraints and stochastic gradient noise in extremely high-dimensional parameter spaces. Motivated by empirical evidence that many LLM gradient matrices are effectively low-rank during training, we present an unbiased, memory-efficient, low-rank matrix estimator with the lowest variance that is applicable across common stochastic gradient estimation paradigms. The core idea is to project a high-dimensional stochastic gradient estimator onto a random low-dimensional subspace and lift it back, reducing memory while keeping the estimator unbiased and controlling mean-squared error via an optimally designed projection distribution, including Haar--Stiefel projections. The projection distribution is derived by solving a constrained functional optimization problem, yielding an optimal random projector that guides algorithm design. Empirically, the resulting low-rank gradient estimators deliver both practical memory savings and improved training behavior. In RoBERTa-large fine-tuning, our method attains the lowest peak GPU memory among compared methods (e.g., 3.83GB versus 16.7GB for full BP) while remaining competitive in accuracy; in autoregressive LLM pretraining (LLaMA-20M/60M/100M), our method outperforms the traditional methods, supporting the benefit of the proposed optimal projection strategy.

[野球の予測モデル] 次の1球で何が起こるのかを予測したい

Qiita

なんと397BのAIモデルをiPhoneで動かすことに成功

GIGAZINE

AI研究におけるボトルネックは人間

GIGAZINE

クレタ人のLLM

Zenn

生成AIが「下手な鉄砲」型サイバー攻撃を増やす、足元固めを急ごう

日経XTECH

Optimal low-rank stochastic gradient estimation for LLM training

要点

Abstract

関連記事

[野球の予測モデル] 次の1球で何が起こるのかを予測したい

なんと397BのAIモデルをiPhoneで動かすことに成功

AI研究におけるボトルネックは人間

クレタ人のLLM

生成AIが「下手な鉄砲」型サイバー攻撃を増やす、足元固めを急ごう

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer