[R] Spectral Compact Training: 172x memory reduction for 70B model training - verified on a Steam Deck (7.24 GB)

Reddit r/MachineLearning / 3/28/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Spectral Compact Training (SCT) represents each weight matrix as W = U diag(s) Vᵀ and trains using the smaller spectral factors, never materializing the full dense weight matrix.
  • The method claims exact gradients with standard backprop and uses QR retraction to keep U and V orthonormal after each optimizer update step.
  • For a LLaMA-3–style 70B-class configuration, the article reports dramatic memory savings for the training step (about 1,245 GB dense+Adam vs 7.24 GB with SCT+Adam, ~172× reduction) while keeping compute time roughly unchanged.
  • The author presents an architectural validation run on a Steam Deck (CPU, 16 GB RAM) achieving a full training step in ~6.28 seconds, emphasizing this is proof-of-concept rather than a fully trained production model.
  • Additional experiments (e.g., XOR and sine regression) are said to show SCT can match dense-training quality, with the approach described as weak below ~1.7B but strong for 7B+ scales.

This is a research article about a patent I filed (not self promotion).

I am dyslexic so I used AI to help with the writing.

I have been working on Spectral Compact Training (SCT). It stores every weight matrix as [ W = U \operatorname{diag}(s) VT ] and trains directly through the small spectral factors.

Never builds the dense matrix. Exact gradients via standard backprop. QR retraction keeps U and V orthonormal after each optimizer step.

Results on a 70B-class architecture (80 layers, hidden=8192, FFN=28672, LLaMA-3 style): Dense + Adam: 1,245 GB SCT + Adam: 7.24 GB Compression: 172x Full training step on Steam Deck: 6.28 seconds Orthonormality error: ( 1.30 \times 10{-6} )

Video proof (full run on Steam Deck CPU, 16 GB RAM): git -> proof

To be clear: this is architectural validation, not a finished trained model. SCT solves the memory wall. Compute time stays the same.

MLP proof shows SCT matches dense training quality exactly (100 percent on XOR, near identical loss on sine regression). Compression scales with model size. It is weak below 1.7B but powerful at 7B+.

Code (Apache 2.0): https://github.com/EctoSpace/SCT

Patent pending. Happy to answer questions about the math or limitations. Looking for arXiv cs.LG endorsers. DM me.

submitted by /u/purdycuz
[link] [comments]