Reinforcement learning for quantum processes with memory

arXiv cs.LG / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper formulates reinforcement learning for quantum processes where a quantum environment keeps a hidden evolving memory and the agent acts through quantum instruments receiving only partial, probabilistic feedback.
It adapts an optimistic maximum-likelihood estimation strategy and extends the framework to continuous action spaces, covering general POVMs, with a regret analysis that controls error propagation through unknown quantum channels.
The authors prove that cumulative regret scales sublinearly as ~O(√K) over K episodes, showing efficient exploration-exploitation performance in this quantum hidden-state setting.
They provide information-theoretic lower bounds via a reduction to the multi-armed quantum bandit problem, establishing that the sublinear √K scaling is essentially optimal (up to polylogarithmic factors).
As an application, the work shows how the method improves state-agnostic free-energy/work extraction from correlated non-i.i.d. quantum states, linking regret directly to cumulative thermodynamic dissipation and achieving asymptotically zero dissipation rate.

Abstract

In reinforcement learning, an agent interacts sequentially with an environment to maximize a reward, receiving only partial, probabilistic feedback. This creates a fundamental exploration-exploitation trade-off: the agent must explore to learn the hidden dynamics while exploiting this knowledge to maximize its target objective. While extensively studied classically, applying this framework to quantum systems requires dealing with hidden quantum states that evolve via unknown dynamics. We formalize this problem via a framework where the environment maintains a hidden quantum memory evolving via unknown quantum channels, and the agent intervenes sequentially using quantum instruments. For this setting, we adapt an optimistic maximum-likelihood estimation algorithm. We extend the analysis to continuous action spaces, allowing us to model general positive operator-valued measures (POVMs). By controlling the propagation of estimation errors through quantum channels and instruments, we prove that the cumulative regret of our strategy scales as

\widetilde{\mathcal{O}}(\sqrt{K})

over

K

episodes. Furthermore, via a reduction to the multi-armed quantum bandit problem, we establish information-theoretic lower bounds demonstrating that this sublinear scaling is strictly optimal up to polylogarithmic factors. As a physical application, we consider state-agnostic work extraction. When extracting free energy from a sequence of non-i.i.d. quantum states correlated by a hidden memory, any lack of knowledge about the source leads to thermodynamic dissipation. In our setting, the mathematical regret exactly quantifies this cumulative dissipation. Using our adaptive algorithm, the agent uses past energy outcomes to improve its extraction protocol on the fly, achieving sublinear cumulative dissipation, and, consequently, an asymptotically zero dissipation rate.