Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

arXiv cs.LG / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes sub-token routing within LoRA-adapted transformers as a finer-grained efficiency control than earlier coarse routing units like tokens, heads, or layers.
  • It argues that, under KV retention budgets, important information is unevenly distributed both across tokens and inside tokens, so KV compression should not be treated as an all-or-nothing per-token choice.
  • For language modeling, the authors introduce a query-independent method combining routed subspace LoRA with value-group routing on the KV path to improve the quality–compression tradeoff.
  • For downstream tasks, they present a query-aware approach that uses a predictor-based selector to allocate a global retention budget across context token/value-group pairs conditioned on query relevance.
  • Experiments indicate that query-independent routing benefits language modeling, while query-aware routing better preserves downstream behavior at reduced KV budgets, and the study shows token-level and sub-token-level routing act as complementary compression axes.

Abstract

Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.