AI Navigate

A Reduction Algorithm for Markovian Contextual Linear Bandits

arXiv cs.LG / 3/16/2026

📰 NewsModels & Research

Key Points

  • It generalizes the reduction framework from i.i.d. contexts to Markovian contextual linear bandits by introducing a stationary surrogate action set and a delayed-update scheme to control bias from nonstationary context distributions.
  • The paper proves high-probability regret bounds that match those of the underlying linear bandit oracle, with only lower-order dependence on the Markov chain's mixing time under uniform geometric ergodicity.
  • It offers a phased algorithm for unknown transition distributions that learns the surrogate mapping online, enabling practical deployment without full model knowledge.
  • By enabling the use of standard linear bandit techniques under Markovian contexts, the work leverages mature analyses for misspecification and adversarial corruption to improve finite-time guarantees.
  • The results have relevance for applications where context availability is temporally correlated, expanding the applicability of contextual bandits to more realistic, non-i.i.d. settings.

Abstract

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap" perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. Motivated by applications with temporally correlated availability, we extend this perspective to Markovian contextual linear bandits, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown transition distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle, with only lower-order dependence on the mixing time.