Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning
arXiv cs.LG / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses overfitting and instability in off-policy reinforcement learning caused by scaling critic networks, especially when using replay-buffer-based bootstrap training.
- It proposes using Low-Rank Adaptation (LoRA) for off-policy critics by freezing randomly initialized base weights and training only low-rank adapters, effectively restricting updates to a low-dimensional subspace.
- The method builds on SimbaV2 and introduces a LoRA formulation compatible with SimbaV2 that preserves its hyperspherical normalization geometry during frozen-backbone training.
- Experiments on DeepMind Control and IsaacLab robotics benchmarks using SAC and FastTD3 show that LoRA yields lower critic loss and better policy performance than alternatives.
- Overall, the authors argue that adaptive low-rank updates provide a simple, scalable structural regularization technique for critic learning in off-policy RL.
