Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a method that augments Disjoint LinUCB for contextual bandits by adding LLM-generated counterfactual (unplayed-arm) reward pseudo-observations after each round to reduce cold-start regret.
It uses a calibration-gated decay schedule that dynamically down-weights LLM influence when the model’s prediction accuracy on played arms is poor, improving robustness early in training.
Experiments on UCI Mushroom and MIND-small show that with a task-specific prompt, LLM pseudo-observations cut cumulative regret by 19% on MIND versus plain LinUCB.
The study finds that using generic counterfactual prompt framing can increase regret on both environments, indicating prompt design is more critical than the decay schedule or calibration-gating hyperparameters.
It analyzes calibration-gating failure modes and provides a theoretical rationale for a bias–variance trade-off that governs how much weight to give pseudo-observations.

Abstract

Contextual bandit algorithms suffer from high regret during cold-start, when the learner has insufficient data to distinguish good arms from bad. We propose augmenting Disjoint LinUCB with LLM pseudo-observations: after each round, a large language model predicts counterfactual rewards for the unplayed arms, and these predictions are injected into the learner as weighted pseudo-observations. The injection weight is controlled by a calibration-gated decay schedule that tracks the LLM's prediction accuracy on played arms via an exponential moving average; high calibration error suppresses the LLM's influence, while accurate predictions receive higher weight during the critical early rounds. We evaluate on two contextual bandit environments - UCI Mushroom (2-arm, asymmetric rewards) and MIND-small (5-arm news recommendation) - and find that when equipped with a task-specific prompt, LLM pseudo-observations reduce cumulative regret by 19% on MIND relative to pure LinUCB. However, generic counterfactual prompt framing increases regret on both environments, demonstrating that prompt design is the dominant factor, more important than the choice of decay schedule or calibration gating parameters. We analyze the failure modes of calibration gating on domains with small prediction errors and provide a theoretical motivation for the bias-variance trade-off governing pseudo-observation weight.