MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

arXiv cs.CL / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

MegaTrain is a memory-centric training system designed to fully precision train 100B+ parameter LLMs using a single GPU by keeping parameters and optimizer states in host (CPU) memory.
Instead of relying on GPU-resident persistent state, MegaTrain streams parameters layer-by-layer into the GPU, computes gradients, and offloads them back, minimizing what must remain on-device.
The system addresses CPU–GPU bandwidth limits with a double-buffered pipelined execution engine that overlaps parameter prefetching, computation, and gradient offloading using multiple CUDA streams.
It avoids persistent autograd graph overhead by using stateless layer templates that bind weights dynamically as they stream in, improving flexibility and reducing graph metadata.
Reported results show reliable 120B training on a single H200 with 1.5TB host memory, up to 1.84× throughput versus DeepSpeed ZeRO-3 (CPU offloading) on 14B, and 7B training with a 512k token context on a GH200.

Abstract

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84

\times

the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Every AI Agent Registry in 2026, Compared

Dev.to

Google isn’t an AI-first company despite Gemini being great

Reddit r/artificial

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Key Points

Abstract

💡 Insights using this article

Related Articles

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Every AI Agent Registry in 2026, Compared

Google isn’t an AI-first company despite Gemini being great

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer