[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

Reddit r/MachineLearning / 2026/3/30

💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep Analysis

共有:

要点

Daniel Vega-Myhre’s blog post details how to design an FP8 GEMM kernel (“MXFP8 GEMM”) that can reach up to ~99% of cuBLAS performance using CUDA plus PTX.
The article deep-dives into the added constraints and implementation challenges specifically introduced by MXFP8, including precision/format handling and kernel design tradeoffs.
It provides practical design guidance on meeting performance while respecting FP8-related limitations, helping practitioners reproduce high-throughput GEMM behavior on modern NVIDIA GPUs.
The post is complemented by related PyTorch/TorchTitan work pointing to sizable pre-training speedups using MXFP8 (and DeepEP) for DeepSeek-V3 on B200.
Overall, the writeup serves as a performance-oriented reference for engineers optimizing GEMM-heavy training/inference pipelines for emerging FP8 formats.

New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and design challenges introduced by MXFP8.

Link: https://danielvegamyhre.github.io/2026/03/29/mxfp8-gemm.html
Original Tweet: https://x.com/vega_myhre/status/2038293614204445039

Additional resources:
MXFP8 and DeepEP for DeepSeek-V3 on B200 w/ TorchTitan: https://pytorch.org/blog/enabling-up-to-41-faster-pre-training-mxfp8-and-deepep-for-deepseek-v3-on-b200-with-torchtitan/

submitted by /u/Benlus
[link] [comments]

リコー、“日本語で推論”できるマルチモーダルLLMを開発　「Gemini 2.5 Pro」に匹敵うたう

ITmedia AI+

Amazon Bedrock Knowledge Bases × OpenSearch Serverless を使用して日本語RAGの精度を引き出す

Qiita

AI活用のコストは「勤怠ツールと同じ」──コロプラが経営指標との接続をあえて急がないワケ

ITmedia AI+

OpenAIがコーディング支援AIツール「Codex」用プラグインを発表、Gmail・Googleドライブ・GitHub・Figma・Notion・Slack・Cloudflare・Boxなど20以上のサービスとの連携を実現

GIGAZINE

SNSはポピュリズムや政治的二極化を助長するがチャットAIは人々を極端な意見から遠ざけて穏健な立場に導く可能性

GIGAZINE

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

要点

関連記事

リコー、“日本語で推論”できるマルチモーダルLLMを開発　「Gemini 2.5 Pro」に匹敵うたう

Amazon Bedrock Knowledge Bases × OpenSearch Serverless を使用して日本語RAGの精度を引き出す

AI活用のコストは「勤怠ツールと同じ」──コロプラが経営指標との接続をあえて急がないワケ

OpenAIがコーディング支援AIツール「Codex」用プラグインを発表、Gmail・Googleドライブ・GitHub・Figma・Notion・Slack・Cloudflare・Boxなど20以上のサービスとの連携を実現

SNSはポピュリズムや政治的二極化を助長するがチャットAIは人々を極端な意見から遠ざけて穏健な立場に導く可能性

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

関連記事

リコー、“日本語で推論”できるマルチモーダルLLMを開発 「Gemini 2.5 Pro」に匹敵うたう

Amazon Bedrock Knowledge Bases × OpenSearch Serverless を使用して日本語RAGの精度を引き出す

AI活用のコストは「勤怠ツールと同じ」──コロプラが経営指標との接続をあえて急がないワケ

OpenAIがコーディング支援AIツール「Codex」用プラグインを発表、Gmail・Googleドライブ・GitHub・Figma・Notion・Slack・Cloudflare・Boxなど20以上のサービスとの連携を実現

SNSはポピュリズムや政治的二極化を助長するがチャットAIは人々を極端な意見から遠ざけて穏健な立場に導く可能性

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

リコー、“日本語で推論”できるマルチモーダルLLMを開発　「Gemini 2.5 Pro」に匹敵うたう