NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

MarkTechPost / 5/18/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • NVIDIA announced a new 4-bit pretraining methodology centered on its NVFP4 microscaling format, designed to improve low-bit training stability and effectiveness.
  • The approach combines selectively using BF16 layers, applying 16×16 Random Hadamard transforms to Wgrad inputs, using 2D weight scaling, and performing stochastic rounding on gradients.
  • NVIDIA validated the method on a 12B hybrid Mamba-Transformer trained over a 10 trillion token horizon, described as the longest publicly documented 4-bit pretraining run.
  • Downstream performance closely matched an FP8 baseline, achieving 62.58% vs 62.62% on MMLU-Pro, indicating strong quality retention at 4-bit pretraining.

NVIDIA introduces a 4-bit pretraining methodology built around the NVFP4 microscaling format — combining selective BF16 layers, 16×16 Random Hadamard Transforms on Wgrad inputs, 2D weight scaling, and stochastic rounding on gradients — validated on a 12B hybrid Mamba-Transformer trained on 10 trillion tokens, the longest publicly documented 4-bit pretraining run, with downstream accuracy closely tracking the FP8 baseline (62.58% vs 62.62% on MMLU-Pro).

The post NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon appeared first on MarkTechPost.