AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
arXiv cs.AI / 3/13/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- AdaFuse targets the latency bottleneck of dynamic adapters by showing that the overhead comes from fragmented CUDA kernel launches rather than the core computations.
- It introduces a token-level pre-gating strategy that makes a single global routing decision for all adapter layers, effectively fixing the execution path per token.
- This enables a fused CUDA kernel that merges all selected LoRA adapters into the backbone model in one efficient pass.
- Experimental results on popular open-source LLMs show comparable accuracy to state-of-the-art dynamic adapters while achieving a decoding latency reduction of over 2.4x.
- The work demonstrates a hardware–software co-design approach to improve inference efficiency without sacrificing model capability.
Related Articles
I Built an AI That Reviews Every PR for Security Bugs — Here's How (2026)
Dev.to
[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning
Reddit r/MachineLearning

I Deployed My Own OpenClaw AI Agent in 4 Minutes — It Now Runs My Life From a $5 Server
Dev.to
I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+
Dev.to

How BAML Brings Engineering Discipline to LLM-Powered Systems
Dev.to