Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
arXiv cs.LG / 4/2/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Spectral Compact Training (SCT) proposes representing MLP weight matrices with permanent truncated SVD factors so the dense matrices are never materialized during training or inference, targeting the memory wall on limited hardware.
- SCT keeps gradients compatible with standard backprop by optimizing compact spectral parameters while retracting the orthogonal factors (U, V) onto the Stiefel manifold using QR after each optimizer step.
- The method reports dramatic memory savings—up to ~199× per MLP layer at rank 32—and demonstrates training steps for 70B-class architectures on a Steam Deck (7.2 GB peak vs. 1,245 GB for dense FP32 Adam training).
- Experiments on SmolLM2-1.7B show that different SVD ranks converge to the same loss floor, suggesting the learning rate schedule is the main bottleneck rather than the MLP rank, with rank 128 as a reported efficiency/perplexity sweet spot.
- SCT also reports practical training gains, including a 46% reduction in GPU memory at rank 32 and a doubling of training throughput.
Related Articles

Black Hat Asia
AI Business

Unitree's IPO
ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA