A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

arXiv stat.ML / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper revisits a finite-armed linear contextual bandit problem in which contexts and rewards evolve according to a finite hidden Markov chain (HMM).
It critiques prior work that reduced the problem to linear contextual bandits by using a simplification where rewards depend on posterior state probabilities rather than directly on hidden states.
The authors propose a “direct approach” that models rewards as depending on the hidden states themselves (in addition to the observed contexts) and aligns more closely with the natural formulation of contextual bandits.
They develop a fully adaptive strategy that estimates the HMM parameters online and prove stronger high-probability regret bounds.
The resulting regret bounds avoid dependence on reward-function specifics (beyond what is needed to estimate HMM parameters) and improve over earlier analyses that provided only expected bounds with more complicated dependencies.

Abstract

We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

Dev.to

How To Optimize Enterprise AI Energy Consumption

Dev.to

A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

Key Points

Abstract

Related Articles

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

How To Optimize Enterprise AI Energy Consumption

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer