| Improves the speed of Mistral Small 4 on CUDA (there was a CPU fallback before) (I wonder if it’s somehow related to the upcoming Mistral model? Maybe not) [link] [comments] |
ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… by lnigam · Pull Request #22286 · ggml-org/llama.cpp
Reddit r/LocalLLaMA / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The ggml-cuda project has added FlashAttention support for specific parameter settings (DKQ=320/DV=256 with ncols2=32).
- The update is reported to improve CUDA performance for Mistral Small 4, where a CPU fallback previously reduced speed.
- By enabling the optimized kernel path on the GPU, the change likely lowers latency and increases throughput compared to the earlier fallback behavior.
- The discussion notes speculation about whether the improvement could relate to upcoming Mistral releases, though no definitive connection is stated.
Related Articles

Black Hat USA
AI Business
LLMs will be a commodity
Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA