MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Reddit r/artificial / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • MegaTrain is a memory-centric training system that targets full-precision (not quantized) training of 100B+ parameter LLMs using a single GPU by storing model parameters and optimizer states in CPU host memory.
  • It treats GPUs as transient compute engines by streaming parameters layer-by-layer, computing gradients, and offloading them to minimize persistent GPU memory usage.
  • To address CPU–GPU bandwidth limits, the system uses a pipelined, double-buffered execution approach with multiple CUDA streams to overlap parameter prefetching, computation, and gradient offloading.
  • It avoids persistent autograd graphs by using stateless layer templates with dynamically bound weights, reducing graph metadata overhead and supporting flexible scheduling.
  • Benchmarks report training up to 120B parameters on a single NVIDIA H200 (1.5TB host memory), 1.84× throughput vs DeepSpeed ZeRO-3 with CPU offloading for 14B models, and 7B models with 512k context on a single GH200.

https://arxiv.org/abs/2604.05091

Abstract: "We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84x the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200."

submitted by /u/nickpsecurity
[link] [comments]