AI Navigate

FeatherOps: Fast fp8 matmul on RDNA3 without native fp8

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • FeatherOps demonstrates fast FP8 matrix multiplication on RDNA3 GPUs even without native FP8 support, achieving performance close to the hardware's theoretical maximum.
  • It is currently a proof-of-concept within ComfyUI, with potential applicability to LLM training kernels beyond just inference.
  • The project traces its lineage to the original Feather kernel by Venom1806 (u/Venom1806 / SuriyaaMM) and aims for further optimization.
  • GitHub and Reddit links are provided, indicating ongoing community collaboration and iterative development.

https://github.com/woct0rdho/ComfyUI-FeatherOps

I'm working on it in ComfyUI, and the kernel can also be used in LLM training.

Although RDNA3 GPUs do not have native fp8, we can surprisingly see speedup with fp8. It's really close to the theoretical max performance of the hardware, unlike the fp16 matmul in ROCm that only reaches half of the max performance.

For now it's a proof of concept rather than great speedup in ComfyUI. It's been a long journey since the original Feather kernel was proposed by u/Venom1806 (SuriyaaMM), and let's see how it can be further optimized.

submitted by /u/woct0rdho
[link] [comments]