Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates how LLM-initialized contextual bandits (CBLI) behave when the synthetic preference data used for warm-starting is corrupted by random noise or label-flipping noise.
  • In aligned domains, warm-starting remains beneficial up to roughly 30% corruption, loses its advantage around 40%, and degrades past 50%.
  • In systematically misaligned domains, the study finds that LLM-generated priors can increase regret compared with a cold-start bandit even without added noise.
  • The authors provide a theoretical framework that decomposes regret changes due to random label noise versus systematic misalignment, and they derive a sufficient condition for when LLM warm starts provably outperform cold starts.
  • Experiments across multiple conjoint datasets and multiple LLMs show that an estimated alignment signal predicts whether warm-starting will improve or worsen recommendation quality.

Abstract

The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30% corruption, loses its advantage around 40%, and degrades performance beyond 50%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit's regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits | AI Navigate