~/satyajit

MegaTrain: training a 120B model on one GPU by inverting where memory lives

mdjsonmcp

2026-06-28 · 8 min · llm · training · systems · cuda · explainer

The thing that stops most people from training a big model isn't FLOPs — it's GPU memory. A 70B model's parameters, gradients, and Adam moments don't fit in an 80GB card, so you reach for tensor/pipeline parallelism across a cluster you may not have, or for offloading systems that thrash and OOM as the model grows.

MegaTrain takes the other path: invert the memory hierarchy. Host CPU memory becomes the authoritative store for all persistent state — parameters, gradients, and optimizer moments — and the GPU is demoted to a transient compute engine that holds only the layer it's working on right now. On a single H200 with 1.5 TB of host RAM, that's enough to train models up to 120B parameters at full precision — no quantization, no second GPU. It's a systems paper, and a good one, so let's read it as systems.

The inversion

Start with the accounting. Mixed-precision Adam costs about 12 bytes per parameter: 2 for the BF16 weight, 2 for the BF16 gradient, and 8 for the FP32 first and second moments. A GPU-centric trainer keeps all of that in HBM, so the moment 12×params12 \times \text{params} exceeds the card, you're done. Move that state to host memory and stream one layer at a time, and the device footprint goes flat while the host's terabytes set the ceiling. Drag the model size and watch where it OOMs:

where the state lives · single H200 (141 GB HBM, 1.5 TB host)
model size70B params · 840 GB state
GPU-centric
all state in HBM (e.g. FSDP)
HBM
OOM — needs 840 GB, has 141
MegaTrain
state in host RAM, one layer streamed
HBM
host
fits — HBM ~6 GB, host 840 / 1536 GB
GPU-centric max
~11B
MegaTrain max
~128B

A single H200’s HBM caps a GPU-centric run at roughly 11B parameters. Move the persistent state to the 1.5 TB host and keep only a streamed layer on the device, and the same card trains past 120B— at full precision, no quantization. The catch is what you trade for it: every layer’s weights now cross PCIe twice per step, which is exactly the bottleneck the double-buffered pipeline exists to hide.

That's the whole idea in one slider: a single H200's 141 GB HBM caps a GPU-centric run around 11–12B parameters, but with the persistent state in 1.5 TB of host RAM and only a streamed layer resident, the same card reaches past 120B. The paper's own architecture makes the split concrete — everything lives in the CPU domain; the GPU domain is scratch:

MegaTrain architecture: the CPU domain holds the parameter store, optimizer states, and CPU Adam in pinned memory; the GPU domain holds transient layer templates, a weight buffer, gradient slabs, and activation checkpoints, connected over PCIe/NVLink-C2C with double buffers.
MegaTrain's architecture (paper, Figure 2): the CPU domain is the authoritative store — layer parameters, optimizer moments (m, v), CPU Adam — in pinned memory. The GPU domain is transient: stateless layer templates, a weight buffer, gradient slabs, and an activation-checkpoint workspace, fed over PCIe / NVLink-C2C through two alternating buffers.

The bottleneck, and the step

Inverting the layout creates one obvious problem: PCIe bandwidth. Every layer's weights now have to cross the bus into the GPU, and its gradients have to cross back — ~128 GB/s on H200's PCIe Gen5, versus 4.8 TB/s of on-card HBM. If you did this naively the GPU would spend most of its life waiting on DMA.

The training step is three phases built to keep that traffic off the critical path:

  1. Streaming forward — stream each layer's weights in (H2D), compute, checkpoint activations every KK layers, release the weights immediately.
  2. Block-wise backward — recompute activations from the nearest checkpoint, stream the layer's weights back in, compute gradients in reverse, offload them (D2H), release.
  3. Optimizer update — run Adam entirely on the CPU (AVX-512), so the freshly computed gradients and the moments never make another round trip to the device.

Block-wise recomputation bounds activation memory at O(NAmaxL/K)O(N \cdot A_{\max} \cdot L/K) — independent of total depth — which is what lets depth scale without the activation memory exploding.

The double-buffered pipeline

This is the optimization that makes it fast instead of merely possible. MegaTrain runs three CUDA streams concurrently — one for compute, one for H2D weight transfer, one for D2H gradient evacuation — and double-buffers the weights so that while the compute stream works on layer ii out of one buffer, the next layer's weights prefetch into the other. Flip between naive serialization and the double-buffered schedule:

CUDA stream pipeline
H2D weightsW0W1W2W3computeL0L1L2L3D2H gradsg0g1g2g3done
wall-clock
10 u
GPU busy
80%
vs naive
1.60×

Double-buffered: while the compute stream chews through layer i, the H2D stream prefetches layer i+1 into the other buffer and the D2H stream drains layer i−1's gradients. The compute lane is gap-free — the GPU never waits on PCIe, and removing this one optimization costs MegaTrain 31% of its throughput.

The coordination is three events — weights-ready, backward-done, buffer-free — and the payoff is a compute lane with no gaps: the GPU never stalls on PCIe. The ablation makes the importance unambiguous. Remove double-buffering and throughput drops 31.3% (266 → 183 TFLOPS at 14B) — by far the largest single contributor, more than the gradient slab pool (−3.3%) or tighter checkpointing. Here's the paper's own timeline of the overlap:

MegaTrain's end-to-end pipelined execution timeline across three CUDA streams, showing weight transfer, forward/backward compute, and gradient offload overlapping in a double-buffered schedule.
The pipelined execution timeline (paper, Figure 3): weight transfer, compute, and gradient offload overlap across three CUDA streams in a double-buffered schedule, with the synchronization events that keep the buffers from colliding.

A few more systems details earn their keep:

What it delivers

The headline is a capability, and it's the most convincing part: 120B parameters on one H200, and a 512K-token context on a single GH200. These are regimes where the offload baselines simply OOM — so the comparison is binary, which is the strongest kind.

Sustained TFLOPS versus model scale from 7B to 120B; MegaTrain stays high and stable while DeepSpeed ZeRO-3, ZeRO-Infinity, and PyTorch degrade and then fail to run.
Throughput vs scale (paper, Figure 1): MegaTrain holds sustained throughput from 7B to 120B, where ZeRO-3 / ZeRO-Infinity / PyTorch degrade and then fall off entirely (they can't fit).

Where the baselines can run but are memory-starved — a 14B model on a PCIe A100 — the margin is large:

14B on a single A100 PCIe — throughput (TFLOPS)
MegaTrain
122
Gemini
15
ZeRO-3 Offload
10
050100150

That's 8.1× over Gemini and 12.2× over ZeRO-3 — and on a 48 GB A6000 or a 24 GB RTX 3090, MegaTrain trains 14B at all (56.8 and 30.2 TFLOPS) while ZeRO-3 OOMs. Crucially, accuracy doesn't move — full precision means no drift:

MetaMathQA accuracyMegaTrainZeRO-3ZeRO-InfinityPyTorch
7B88.9988.9388.9788.91
14B92.5292.41

On depth, it's the only system that keeps going: ZeRO-3 OOMs by 132 layers and FSDP by 84, while MegaTrain runs the whole range and is 6.14× faster than FSDP at 56 layers. On width, both baselines OOM at 4.0× while MegaTrain alone reaches 5.0×.

What I make of it

The honest summary: MegaTrain redefines the memory ceiling for single-GPU training — provably, at full precision, with public code — and the double-buffered pipeline is a genuinely nice piece of CUDA-stream engineering. Just don't read "1.84× faster" as a general speedup; read it as "it runs, fast enough, where nothing else runs at all."


Built on MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU (Yuan et al., 2026; code). All numbers are from the paper's tables and figures.

share