[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch Sizes

Reddit r/MachineLearning / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article describes a fused Mixture-of-Experts (MoE) forward-pass dispatch kernel implemented entirely in Triton (without CUDA or vendor-specific code).
  • On Mixtral-8x7B running on an A100, the Triton approach outperforms Stanford’s Megablocks at inference-relevant batch sizes, achieving 131% at 32 tokens and 124% at 128 tokens.
  • It introduces a fused gate-plus-up projection that reuses input tile loads and computes SiLU in registers, reducing intermediate buffer usage and cutting memory traffic by about 35% (approximately 470MB per forward pass).
  • It also uses a block-scheduled grouped GEMM with a precomputed mapping from block_id to (expert_id, offset) to handle variable-sized expert batches in a single kernel launch without padding.
  • The method reportedly passes full tests across multiple MoE models (Mixtral-8x7B, DeepSeek-V3 with 256 experts, Qwen2-MoE) and works on AMD MI300X with no code changes.

I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-specific code.

On Mixtral-8x7B (A100), it beats Stanford's Megablocks at inference-relevant batch sizes (131% at 32 tokens, 124% at 128 tokens). At larger batches Megablocks' hand-tuned CUDA pulls ahead as expected.

Two main contributions:

  1. Fused gate+up projection - both GEMMs share the same input tile load, SiLU computed in registers. Eliminates ~470MB of intermediate buffers per forward pass (35% memory traffic reduction).
  2. Block-scheduled grouped GEMM - precomputed block_id to (expert_id, offset) mapping handles variable-sized expert batches in a single kernel launch without padding.

Tested across Mixtral-8x7B, DeepSeek-V3 (256 experts), and Qwen2-MoE. Full test suite passes on AMD MI300X with zero code changes.

Code: https://github.com/bassrehab/triton-kernels

Writeup: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/

submitted by /u/bassrehab
[link] [comments]