I wrote a fused MoE dispatch kernel in pure Triton that beats Megablocks on Mixtral and DeepSeek at inference batch sizes

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The post presents a custom, fused Mixture-of-Experts (MoE) inference dispatch pipeline implemented purely in Triton that reduces the MoE forward pass to about 5 kernel launches versus 24+ in a naive implementation.
  • Benchmarks on Mixtral-8x7B (A100) show large speedups over PyTorch at common serving batch/token sizes (e.g., ~4.9–6.5x faster), and at 32–128 tokens it outperforms Megablocks, though Megablocks regains the lead at 512+ tokens due to its optimized block-sparse matmul.
  • A central optimization is fusing the gate and up projection GEMMs so they reuse the same L2-resident input tiles, while performing SiLU in registers to avoid global memory roundtrips and save substantial memory traffic per forward pass.
  • The author reports additional validation on DeepSeek-V3 (256 experts) and Qwen2-MoE, and demonstrates portability to AMD MI300X with zero code changes and all 162 tests passing.
  • The code and a detailed writeup (including roofline analysis) are published in a GitHub repository and blog post, aiming to make high-performance MoE inference kernels more accessible to practitioners.

Been working on custom Triton kernels for LLM inference for a while. My latest project: a fused MoE dispatch pipeline that handles the full forward pass in 5 kernel launches instead of 24+ in the naive approach.

Results on Mixtral-8x7B (A100):

Tokens vs PyTorch vs Megablocks
32 4.9x 131%
128 5.8x 124%
512 6.5x 89%

At 32 and 128 tokens (where most inference serving actually happens), it's faster than Stanford's CUDA-optimized Megablocks. At 512+ Megablocks pulls ahead with its hand-tuned block-sparse matmul.

The key trick is fusing the gate+up projection so both GEMMs share the same input tile from L2 cache, and the SiLU activation happens in registers without ever hitting global memory. Saves ~470MB of memory traffic per forward pass on Mixtral.

Also tested on DeepSeek-V3 (256 experts) and Qwen2-MoE. Ran the full suite on AMD MI300X with zero code changes, all 162 tests passing.

Code: https://github.com/bassrehab/triton-kernels

Full writeup with roofline analysis: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/

submitted by /u/bassrehab
[link] [comments]