RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that Mixture-of-Experts (MoE) inference performance depends not only on batch size but also on the expert routing distribution, and that existing production dispatch policies miss 10–70% of potential kernel throughput.
It proposes RaMP, a routing-aware dispatch framework that uses a performance-region analysis (based on hardware constants) to determine when different optimizations matter across multiple architectures.
RaMP includes a four-parameter “wave cost” model that selects near-optimal kernel configurations from a runtime expert histogram, achieving 0.93% mean regret versus exhaustive search with only 10–24 minutes of one-time profiling.
The approach is kernel-agnostic: driven only by CTA grid geometry, it can be applied to Alpha-MoE without source changes and also benefits from a co-designed CuTe DSL kernel with 134–268 polymorphic configurations.
Reported speedups include 1.22× kernel and 1.30× end-to-end improvements in vLLM serving over Triton, plus additional gains across DeepGEMM and FlashInfer CUTLASS backends.

Abstract

The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

Claude Opus 4.7: What Actually Changed and Whether You Should Migrate

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

The Inference Inflection: Why AI's Center of Gravity Has Shifted from Training to Inference

Dev.to

Mastering On-Device GenAI: How to Fine-Tune LLMs for Android Using LoRA and Kotlin 2.x

Dev.to

Everyone is Building MCP-Powered AI Apps Now But Is Model Context Protocol Actually Worth The Hype?

Dev.to

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Key Points

Abstract

Related Articles

Claude Opus 4.7: What Actually Changed and Whether You Should Migrate

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

The Inference Inflection: Why AI's Center of Gravity Has Shifted from Training to Inference

Mastering On-Device GenAI: How to Fine-Tune LLMs for Android Using LoRA and Kotlin 2.x

Everyone is Building MCP-Powered AI Apps Now But Is Model Context Protocol Actually Worth The Hype?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer