Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper argues that parameter-efficient fine-tuning (e.g., LoRA, IA3) does not automatically translate into memory efficiency for on-device LLM adaptation.
  • It shows that even with fewer trainable parameters, PEFT methods can still require intermediate tensors that grow linearly with sequence length, leading to out-of-memory issues on-device.
  • The authors propose LARS (Low-memory Activation-Rank Subspace), which constrains the activation subspace during training to decouple memory usage from sequence length.
  • Experiments report average memory reductions of 33.54% on GPUs and 51.95% on CPUs versus LoRA, while maintaining competitive accuracy and throughput across multiple datasets and model types.
  • The framework is also demonstrated on Raspberry Pi and consumer-grade CPUs, indicating a practical route to personalized LLM adaptation on resource-limited edge hardware.

Abstract

Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory efficiency and on-device adaptability. We show that this is not true - while methods like LoRA and IA3 significantly reduce trainable parameters, they remain bound by intermediate tensors that scale linearly with sequence length, often triggering out-of-memory errors on-device. In this work, we introduce LARS (Low-memory Activation-Rank Subspace), a novel adaptation framework that decouples memory consumption from sequence length. While prior PEFT methods apply low-rank constraints to model parameters, LARS instead constrains the activation subspace used during training, directly targeting the dominant source of memory consumption and fundamentally flattening the memory growth rate. LARS reduces the memory footprint by an average of 33.54% on GPUs and 51.95% on CPUs in comparison to LoRA across reasoning, understanding and long-context datasets using different models while maintaining competitive accuracy and throughput. Besides GPUs, we deploy on Raspberry Pi and consumer-grade CPUs to demonstrate that LARS provides a scalable path for sophisticated LLM personalization on resource-constrained hardware and edge devices.