Optimal low-rank stochastic gradient estimation for LLM training
arXiv cs.LG / 2026/3/24
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper proposes an unbiased, memory-efficient low-rank stochastic gradient estimator for LLM training that targets both memory bottlenecks and stochastic gradient noise in high-dimensional parameter spaces.
- It reduces memory by projecting a high-dimensional gradient estimator onto a randomly chosen low-dimensional subspace and then lifting it back, while controlling mean-squared error through an optimally designed projection distribution.
- The optimal random projector is derived via a constrained functional optimization problem, including Haar–Stiefel projections, to guide how the projection should be sampled during training.
- Experiments on RoBERTa-large fine-tuning show substantially lower peak GPU memory (e.g., 3.83GB vs 16.7GB for full backprop) with competitive accuracy.
- In autoregressive LLM pretraining (LLaMA-20M/60M/100M), the method reportedly outperforms traditional low-rank/estimation approaches, indicating improved training behavior from the proposed projection strategy.

