Automatic feature identification in least-squares policy iteration using the Koopman operator framework

arXiv cs.LG / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces KAE-LSPI, a reinforcement learning method that combines Koopman autoencoders with least-squares policy iteration by reformulating LS fixed-point approximation via EDMD.
  • It aims to address a key limitation of linear RL approaches—lack of a systematic way to choose features or kernels—by learning features automatically through the Koopman autoencoder (KAE) framework.
  • The authors benchmark KAE-LSPI against classical LSPI and kernel-based LSPI (KLSPI) using stochastic chain walk and inverted pendulum control tasks.
  • Results indicate that KAE-LSPI can learn a reasonable number of features and achieves convergence to optimal or near-optimal policies comparable to the fixed-feature/kernel baselines without predefining features.
  • The contribution is positioned as a unifying, Koopman-operator-based route to automated feature learning for least-squares RL control.

Abstract

In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.