ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

arXiv cs.LG / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

Mixture-of-LoRAs can suffer from imbalanced routing weights, causing only a few LoRAs to dominate and limiting expressivity.
ReMix introduces non-learnable routing weights to keep all active LoRAs effective, preventing domination by a single LoRA.
To train with non-learnable weights, ReMix uses an unbiased gradient estimator based on reinforce leave-one-out, treating the supervision loss as the reward.
Extensive experiments show ReMix significantly outperforms state-of-the-art parameter-efficient fine-tuning methods with a comparable number of activated parameters.

Abstract

Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI

Dev.to

The Lemma

Dev.to

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

Dev.to

[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)

Reddit r/MachineLearning

[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]

Reddit r/MachineLearning

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Key Points

Abstract

Related Articles

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI

The Lemma

Your Agent Will Eventually Do Something Catastrophic. Here's How to Prevent It.

[D] Modeling online discourse escalation as a state machine (dataset + labeling approach)

[R] Is this paper Nonsense ? [DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer