RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that Mixture-of-Experts (MoE) inference performance depends not only on batch size but also on the expert routing distribution, and that existing production dispatch policies miss 10–70% of potential kernel throughput.
  • It proposes RaMP, a routing-aware dispatch framework that uses a performance-region analysis (based on hardware constants) to determine when different optimizations matter across multiple architectures.
  • RaMP includes a four-parameter “wave cost” model that selects near-optimal kernel configurations from a runtime expert histogram, achieving 0.93% mean regret versus exhaustive search with only 10–24 minutes of one-time profiling.
  • The approach is kernel-agnostic: driven only by CTA grid geometry, it can be applied to Alpha-MoE without source changes and also benefits from a co-designed CuTe DSL kernel with 134–268 polymorphic configurations.
  • Reported speedups include 1.22× kernel and 1.30× end-to-end improvements in vLLM serving over Triton, plus additional gains across DeepGEMM and FlashInfer CUTLASS backends.

Abstract

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.