On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

arXiv cs.AI / 5/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies online learning for finite-horizon Tree MDPs (T-MDPs) under both PAC (sample complexity) and regret-minimization settings.
  • It shows that standard bandit methods (LUCB and UCB) can be used for T-MDPs by modeling each policy as a bandit arm.
  • A key hurdle is that the number of policies grows exponentially with the number of states, but the authors introduce confidence bounds that exploit data shared across policies.
  • This shared-data confidence-bounding lets the bandit algorithms run with polynomial memory and per-step computation while providing instance-dependent bounds on both sample complexity and regret.
  • Empirical results indicate the proposed approach outperforms existing alternatives on hidden-information games.

Abstract

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state s_{1}, in which every state is reachable from s_{1} through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.