AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
arXiv cs.AI / 3/13/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- AdaFuse targets the latency bottleneck of dynamic adapters by showing that the overhead comes from fragmented CUDA kernel launches rather than the core computations.
- It introduces a token-level pre-gating strategy that makes a single global routing decision for all adapter layers, effectively fixing the execution path per token.
- This enables a fused CUDA kernel that merges all selected LoRA adapters into the backbone model in one efficient pass.
- Experimental results on popular open-source LLMs show comparable accuracy to state-of-the-art dynamic adapters while achieving a decoding latency reduction of over 2.4x.
- The work demonstrates a hardware–software co-design approach to improve inference efficiency without sacrificing model capability.



