Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

MoonshotAI has open-sourced FlashKDA, a CUTLASS (C++) forward-kernel implementation for Kimi Delta Attention (KDA), a linear attention variant from the Kimi Linear paper.
FlashKDA integrates with the Flash Linear Attention (FLA) project as a backend via FLA pull request #852, enabling existing FLA-based KDA models to use it transparently.
On NVIDIA H20 (SM90+), benchmarks against FLA’s existing Triton path show up to 1.72x speedup for fixed-length, 1.95x for mixed variable-length sequences, and 2.22x for a specific uniform variable-length setting.
The article emphasizes that linear-attention scaling benefits depend on truly hardware-efficient kernels, and CUTLASS tuned for Hopper memory-access patterns helps close the gap between theory and real GPU performance.
FlashKDA currently supports forward pass only and is licensed under MIT; requirements include CUDA 12.9+, PyTorch 2.4+, and SM90+ hardware, so training/backward use cases remain limited for now.

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Been comparing how different routing layers handle K2.6 this week, OpenRouter, Together, Orq, and while digging around I came across FlashKDA which Moonshot dropped alongside the K2.6 activity. Seems to be flying under the radar, sharing here because the kernel work is genuinely interesting on its own, separate from the model release.

What it is. A CUTLASS C++ implementation of the forward kernel for Kimi Delta Attention, the linear attention variant from the Kimi Linear paper. It plugs into flash-linear-attention as a backend through FLA pull request #852, so anyone already using FLA for KDA based models can route through FlashKDA at the backend layer.

Numbers from their H20 benchmark, measured against FLA's existing Triton path:

At T=8192, H=96, D=128, fixed length sequences, 1.72x. Variable length with mixed seq_lens, 1.95x. Variable length with uniform 1024x8, 2.22x.

Why this matters. Linear attention architectures like KDA promise linear scaling with sequence length, but the promise only holds if the kernel implementation is actually hardware efficient. FLA's Triton path is the reference and it works, but CUTLASS tuned for Hopper memory access patterns is how you close the gap between the theoretical cost model and what you see on a real GPU.

Requirements are SM90 and above, CUDA 12.9 and above, PyTorch 2.4 and above. MIT licensed.

One honest limitation worth flagging, the benchmark is forward pass only and all numbers are on H20. H20 is the China specific Hopper variant so absolute numbers on H100 or Blackwell will differ. The relative speedup should be directionally similar but nobody has posted those numbers yet.

Curious whether anyone on here has tested it on H100, or has thoughts on when a backward pass kernel might land. The forward only story limits the training use case right now.

submitted by /u/Cosmicdev_058
[link] [comments]