Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits
arXiv cs.LG / 4/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces a method that augments Disjoint LinUCB for contextual bandits by adding LLM-generated counterfactual (unplayed-arm) reward pseudo-observations after each round to reduce cold-start regret.
- It uses a calibration-gated decay schedule that dynamically down-weights LLM influence when the model’s prediction accuracy on played arms is poor, improving robustness early in training.
- Experiments on UCI Mushroom and MIND-small show that with a task-specific prompt, LLM pseudo-observations cut cumulative regret by 19% on MIND versus plain LinUCB.
- The study finds that using generic counterfactual prompt framing can increase regret on both environments, indicating prompt design is more critical than the decay schedule or calibration-gating hyperparameters.
- It analyzes calibration-gating failure modes and provides a theoretical rationale for a bias–variance trade-off that governs how much weight to give pseudo-observations.


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)
