RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
arXiv cs.AI / 4/30/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper argues that Mixture-of-Experts (MoE) inference performance depends not only on batch size but also on the expert routing distribution, and that existing production dispatch policies miss 10–70% of potential kernel throughput.
- It proposes RaMP, a routing-aware dispatch framework that uses a performance-region analysis (based on hardware constants) to determine when different optimizations matter across multiple architectures.
- RaMP includes a four-parameter “wave cost” model that selects near-optimal kernel configurations from a runtime expert histogram, achieving 0.93% mean regret versus exhaustive search with only 10–24 minutes of one-time profiling.
- The approach is kernel-agnostic: driven only by CTA grid geometry, it can be applied to Alpha-MoE without source changes and also benefits from a co-designed CuTe DSL kernel with 134–268 polymorphic configurations.
- Reported speedups include 1.22× kernel and 1.30× end-to-end improvements in vLLM serving over Triton, plus additional gains across DeepGEMM and FlashInfer CUTLASS backends.
Related Articles
Claude Opus 4.7: What Actually Changed and Whether You Should Migrate
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
The Inference Inflection: Why AI's Center of Gravity Has Shifted from Training to Inference
Dev.to
Mastering On-Device GenAI: How to Fine-Tune LLMs for Android Using LoRA and Kotlin 2.x
Dev.to
Everyone is Building MCP-Powered AI Apps Now But Is Model Context Protocol Actually Worth The Hype?
Dev.to