CUDA: reduce MMQ stream-k overhead by JohannesGaessler · Pull Request #22298 · ggml-org/llama.cpp

Reddit r/LocalLLaMA / 4/25/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A pull request to the ggml-org/llama.cpp repository proposes CUDA changes to reduce MMQ (matrix-multiply/quantization) stream-k overhead during prompt processing.
  • The update is aimed at improving prompt-processing speed specifically in Mixture-of-Experts (MoE) scenarios.
  • The post points readers to an associated GitHub issue comment for additional details on the proposed performance improvement.
  • The work is part of ongoing optimization efforts for running LLMs efficiently on NVIDIA GPUs using CUDA.
  • Expected outcome is lower runtime overhead and faster prompt throughput for CUDA-based deployments of llama.cpp, particularly with MoE models.