POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Edge LLM serving often keeps only a small subset of LoRA adapters resident in GPU/DRAM, so routing requests to non-resident adapters requires weight paging that adds latency.
The paper frames caching (slow timescale) and routing (fast timescale) as a coupled two-timescale contextual bandit problem, where cache decisions affect exploration cost and routing choices determine which adapters provide feedback.
POLAR introduces an epoch-based cache controller paired with a cache-aware LinUCB router, and studies both a fixed-epoch variant (with worst-case regret guarantees) and POLAR+ (epoch-doubling with forced exploration).
POLAR+ achieves sublinear regret under stochastic regularity and cacheability assumptions, and experiments with 15 real LoRA adapters for Qwen2.5-7B confirm that adaptive cache control beats non-adaptive baselines while matching the theory’s scaling trends.
The results indicate that memory hierarchy constraints mainly impact cache optimization rather than fundamentally slowing down routing learning.

Abstract

Edge deployment of large language models (LLMs) increasingly relies on libraries of lightweight LoRA adapters, yet GPU/DRAM can keep only a small resident subset at a time. Serving a request through a non-resident adapter requires paging its weights from storage, incurring measurable latency. This creates a two-timescale online control problem: on a slow timescale, the system selects which adapters remain resident in fast memory, while on a fast timescale it routes each request to an adapter whose context-dependent utility is unknown a priori. The two decisions are tightly coupled: the cache determines the cost of exploration, and the router determines which adapters receive informative feedback. We formulate this joint caching-and-routing problem as a two-timescale contextual bandit and propose POLAR (Paging and Online Learning for Adapter Routing). POLAR pairs a cache-aware LinUCB router with an epoch-based cache controller. We study two variants. A fixed-epoch version provides a robust baseline with worst-case regret guarantees under arbitrary contexts. An epoch-doubling version, POLAR+, adds forced exploration and improved cache optimization to achieve

\widetilde{\mathcal{O}}(d\sqrt{NT}+\sqrt{KT})

sublinear regret under stochastic regularity and cacheability conditions, where

N

is the adapter count,

K

the cache size,

d

the context dimension, and

T

the horizon. The routing term matches the standard contextual-bandit rate up to logarithmic factors, showing that the memory hierarchy does not fundamentally slow routing learning. Experiments using 15 real LoRA adapters for Qwen2.5-7B together with measured GPU paging latencies show that adaptive cache control substantially outperforms non-adaptive baselines and exhibits scaling trends consistent with the theory.