広告

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

Reddit r/MachineLearning / 2026/3/30

💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep Analysis

要点

  • Daniel Vega-Myhre’s blog post details how to design an FP8 GEMM kernel (“MXFP8 GEMM”) that can reach up to ~99% of cuBLAS performance using CUDA plus PTX.
  • The article deep-dives into the added constraints and implementation challenges specifically introduced by MXFP8, including precision/format handling and kernel design tradeoffs.
  • It provides practical design guidance on meeting performance while respecting FP8-related limitations, helping practitioners reproduce high-throughput GEMM behavior on modern NVIDIA GPUs.
  • The post is complemented by related PyTorch/TorchTitan work pointing to sizable pre-training speedups using MXFP8 (and DeepEP) for DeepSeek-V3 on B200.
  • Overall, the writeup serves as a performance-oriented reference for engineers optimizing GEMM-heavy training/inference pipelines for emerging FP8 formats.

New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and design challenges introduced by MXFP8.

Link: https://danielvegamyhre.github.io/2026/03/29/mxfp8-gemm.html
Original Tweet: https://x.com/vega_myhre/status/2038293614204445039

Additional resources:
MXFP8 and DeepEP for DeepSeek-V3 on B200 w/ TorchTitan: https://pytorch.org/blog/enabling-up-to-41-faster-pre-training-mxfp8-and-deepep-for-deepseek-v3-on-b200-with-torchtitan/

submitted by /u/Benlus
[link] [comments]

広告