Inside NVIDIA's 4-Bit Pretraining Breakthrough: NVFP4 and the 12B Hybrid Mamba-Transformer

NVIDIA has introduced a novel 4-bit pretraining methodology built around their NVFP4 format, validated by training a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. This marks the longest publicly documented 4-bit training run. The model achieves nearly identical accuracy to its FP8 counterpart (62.58% vs. 62.62% on MMLU-Pro), while benefiting from up to 6× throughput gains on Blackwell GPUs. Below, we answer key questions about this breakthrough.

What is NVFP4 and how does it differ from standard MXFP4?

NVFP4 is NVIDIA's 4-bit microscaling format designed for Blackwell Tensor Cores. Unlike the industry-standard MXFP4, which uses 32-element blocks with each element stored as E2M1 (1 sign, 2 exponent, 1 mantissa) and block scales in UE8M0 (powers of two), NVFP4 introduces three critical changes:

Inside NVIDIA's 4-Bit Pretraining Breakthrough: NVFP4 and the 12B Hybrid Mamba-Transformer — Source: www.marktechpost.com

Smaller blocks: Block size drops from 32 to 16 elements, reducing dynamic range per block.
Richer block scales: Block scale factors use E4M3 format instead of UE8M0, trading exponent range for mantissa precision. This allows the per-block absolute maximum (amax) to map closer to FP4's maximum representable value.
Two-level scaling: An FP32 per-tensor scale remaps values to keep E4M3 block scales in range.

The result: at least 6.25% of values in each block (those at the amax) are represented at near-FP8 precision, while the rest remain in FP4. This design balances compression with accuracy.

Which model components are quantized to NVFP4, and which are not?

Only the GEMMs inside linear (fully-connected) layers—specifically during forward propagation (Fprop), backward weight gradient (Wgrad), and backward data gradient (Dgrad)—run in NVFP4. All other operations remain in higher precision:

Embeddings and the output projection head stay in BF16 or FP32.
Normalization layers, non-linearities, and all attention components (softmax, query-key and attention score-value batched GEMMs) use BF16 or FP32.
Model weights, weight gradients for accumulation across microbatches and data-parallel replicas, and optimizer states are kept in FP32.
Tensor parallel reductions run in BF16.

This selective quantization preserves critical precision where needed while maximizing throughput in the most compute-intensive layers.

What are the four parts of NVIDIA's 4-bit training methodology?

Quantizing every linear-layer GEMM to NVFP4 with default settings (1×16 block scaling everywhere, round-to-nearest-even on every tensor, no transforms) causes early divergence in training. To overcome this, NVIDIA developed a four-part methodology (detailed in their paper):

Adaptive block scaling: Dynamically adjust block scale granularity based on tensor statistics to minimize quantization error.
Mixed-precision residual connections (inferred from context): Keep residual streams in higher precision to preserve gradient flow.
Stochastic rounding: Replace round-to-nearest-even with stochastic rounding during weight updates to reduce bias accumulation.
Gradual quantization schedule: Start training in higher precision and gradually transition to NVFP4 over a few thousand steps.

The exact implementation details are in the arXiv paper (2509.25149), but these innovations collectively enable stable 4-bit training at 10-trillion-token scale.

What accuracy and throughput does the NVFP4-trained model achieve?

The 12B-parameter hybrid Mamba-Transformer trained with NVFP4 achieves 62.58% on MMLU-Pro (5-shot), compared to 62.62% for the FP8 baseline—a negligible 0.04% degradation. In terms of performance, NVFP4 GEMMs run at:

4× BF16 throughput on GB200 GPUs
6× BF16 throughput on GB300 GPUs

This translates to roughly 2× and 3× speedups over FP8, respectively. Additionally, operand memory footprint is approximately halved compared to FP8, reducing memory bandwidth pressure. The model is supported in NVIDIA's Transformer Engine for easy adoption.

Is this the longest 4-bit training run documented?

Yes, according to the NVIDIA research team, this is the longest publicly documented training run in 4-bit precision to date. The 12B-parameter model was trained on 10 trillion tokens, demonstrating that 4-bit pretraining can scale to frontier-level compute budgets without catastrophic divergence or significant accuracy loss. Previous 4-bit efforts were limited to smaller models, shorter token horizons, or post-training quantization. This result validates that 4-bit pretraining is a viable alternative to FP8 for large-scale LLM development.

Which hardware supports NVFP4 and how can I use it?

NVFP4 is natively supported by NVIDIA Blackwell Tensor Cores (GB200 and GB300). The format is integrated into NVIDIA's Transformer Engine, which automatically selects the optimal precision for each layer. Developers can enable NVFP4 by setting the desired precision level in the Transformer Engine configuration—no manual code changes are required for standard linear layers. The companion arXiv paper (2509.25149) provides full algorithmic details and ablation studies. Enterprises with Blackwell GPUs can expect immediate throughput improvements when adopting NVFP4 for both pretraining and fine-tuning workloads.

Tags: