Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a formal $n$-step extension of soft Q-learning for entropy-regularised reinforcement learning beyond prior limitations.
  • It introduces a new Soft Tree Backup operator to enable a fully off-policy multi-step setting without relying on on-policy Boltzmann sampling.
  • The authors combine these ideas into “Soft $Q(\lambda)$,” an online, off-policy eligibility-trace framework designed for efficient credit assignment under arbitrary behaviour policies.
  • The work presents derivations for a model-free approach to learning entropy-regularised value functions that can support future empirical experimentation.

Abstract

Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal n-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft Q(\lambda), an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.