Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction

arXiv cs.LG / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Spectral Compact Training (SCT) proposes representing MLP weight matrices with permanent truncated SVD factors so the dense matrices are never materialized during training or inference, targeting the memory wall on limited hardware.
  • SCT keeps gradients compatible with standard backprop by optimizing compact spectral parameters while retracting the orthogonal factors (U, V) onto the Stiefel manifold using QR after each optimizer step.
  • The method reports dramatic memory savings—up to ~199× per MLP layer at rank 32—and demonstrates training steps for 70B-class architectures on a Steam Deck (7.2 GB peak vs. 1,245 GB for dense FP32 Adam training).
  • Experiments on SmolLM2-1.7B show that different SVD ranks converge to the same loss floor, suggesting the learning rate schedule is the main bottleneck rather than the MLP rank, with rank 128 as a reported efficiency/perplexity sweet spot.
  • SCT also reports practical training gains, including a 46% reduction in GPU memory at rank 32 and a doubling of training throughput.

Abstract

The memory wall remains the primary bottleneck for training large language models on consumer hardware. We introduce Spectral Compact Training (SCT), a method that replaces dense weight matrices with permanent truncated SVD factors W = U diag(s) V^T, where the full dense matrix is never materialized during training or inference. Gradients flow through the compact spectral factors via standard backpropagation, and U, V are retracted to the Stiefel manifold via QR decomposition after each optimizer step. SCT achieves up to 199x memory reduction per MLP layer at rank 32, enabling full training steps of 70B-parameter architectures on a Steam Deck handheld (7.2 GB peak memory vs. 1,245 GB for dense FP32 training with Adam). Rank-sweep experiments on SmolLM2-1.7B (ranks 32-256, 2000 steps, NVIDIA A100) show that all tested ranks converge to the same loss floor (~4.2-4.5), identifying the learning rate schedule -- not MLP rank -- as the primary bottleneck. Rank 128 emerges as the efficiency sweet spot at 11.7x MLP compression with the lowest perplexity. GPU memory drops 46% at rank 32 while training throughput doubles.