ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

Reddit r/LocalLLaMA / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The ggml-cuda project has added FlashAttention support for specific parameter settings (DKQ=320/DV=256 with ncols2=32).
  • The update is reported to improve CUDA performance for Mistral Small 4, where a CPU fallback previously reduced speed.
  • By enabling the optimized kernel path on the GPU, the change likely lowers latency and increases throughput compared to the earlier fallback behavior.
  • The discussion notes speculation about whether the improvement could relate to upcoming Mistral releases, though no definitive connection is stated.
ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp

Improves the speed of Mistral Small 4 on CUDA

(there was a CPU fallback before)

(I wonder if it’s somehow related to the upcoming Mistral model? Maybe not)

submitted by /u/jacek2023
[link] [comments]