Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

arXiv cs.LG / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The study argues that standard LoRA practices that apply adapters uniformly are suboptimal for hybrid language models, because different component types (attention vs recurrent/SSM) play distinct functional roles.
Experiments on Qwen3.5-0.8B and Falcon-H1-0.5B show that placing LoRA on the attention pathway—though it is a smaller component—consistently yields better performance than adapting the full model while using 5–10× fewer trainable parameters.
Adapting the recurrent backbone has architecture-dependent effects: it is destructive in sequential hybrids (e.g., −14.8 pp on GSM8K) but constructive in parallel hybrids (+8.6 pp).
The authors also find transfer asymmetry, with parallel hybrids benefiting from positive cross-task transfer while sequential hybrids experience catastrophic forgetting.
Overall, the paper concludes that hybrid topology fundamentally changes how adaptation responds, making component-aware LoRA placement an essential design consideration for hybrid architectures.

Abstract

Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.