MXNorm: Reusing MXFP block scales for efficient tensor normalisation

arXiv cs.LG / 3/16/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

MXNorm is a drop-in replacement for RMSNorm that estimates RMS using only the block scales from MXFP8, enabling a 32x reduction in the size of the reduction needed for normalization.
The method is validated on pre-training of Llama 3 models (125M, 1B, 8B) with minimal accuracy loss compared to a RMSNorm baseline.
It achieves practical kernel speedups up to 2.4x using only torch.compile, with reported around 1.3% speedup in Llama 3 8B transformer layers (MXFP8) and 2.6% in NVFP4.
As a hardware-conscious optimization that reuses existing MXFP8 scales, MXNorm reduces normalization compute and improves efficiency without requiring major changes to model code.

Abstract

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the RMS using only the block scales calculated as part of the MXFP8 cast and enables a 32x decrease in the size of reduction needed for normalization. We validate our approximation method on pre-training of Llama 3 models of 125M, 1B and 8B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls. We also show practical kernel speedups using only torch.compile of up to 2.4x for MXNorm over RMSNorm, corresponding to a 1.3% speedup in Llama 3 8B transformer layers in MXFP8 and a 2.6% speedup in NVFP4.

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

Dev.to

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Key Points

Abstract

Related Articles

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer