https://github.com/woct0rdho/ComfyUI-FeatherOps
I'm working on it in ComfyUI, and the kernel can also be used in LLM training.
Although RDNA3 GPUs do not have native fp8, we can surprisingly see speedup with fp8. It's really close to the theoretical max performance of the hardware, unlike the fp16 matmul in ROCm that only reaches half of the max performance.
For now it's a proof of concept rather than great speedup in ComfyUI. It's been a long journey since the original Feather kernel was proposed by u/Venom1806 (SuriyaaMM), and let's see how it can be further optimized.
[link] [comments]