On Pareto Optimality for Parametric Choice Bandits

arXiv stat.ML / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies online assortment optimization under stochastic choice where the goal is not only to maximize cumulative revenue but also to improve the quality of post-hoc inference on revenue contrasts.
  • It analyzes a forced-exploration optimistic (OFU) method that uses two regularized maximum-likelihood estimators: one for sequential decision-making using all data and another for inference using only exploration-round data.
  • The authors develop a general theoretical framework (score proxies and curvature domination) that yields a self-normalized concentration inequality, a likelihood-based ellipsoidal confidence-set theorem, and a regret bound that explicitly includes optimization error.
  • For the multinomial logit (MNL) model, they provide explicit score/curvature proxies and show that a balanced spaced singleton-exploration schedule achieves regret O~(n_T + T/\sqrt{n_T}) and revenue-contrast error O~(1/\sqrt{n_T}), with matching lower bounds in a hard two-assortment subclass.
  • The paper identifies a Pareto-undominated exploration regime when n_T scales as T^\alpha, giving regret rate O~(T^{max{\alpha,1-\alpha/2}}) and inference rate O~(T^{-\alpha/2}), with \alpha=2/3 as the unique balancing point that minimizes the regret exponent, and it provides sufficient conditions for other models (Exponomial Choice and Nested Logit).

Abstract

We study online assortment optimization under stochastic choice when a decision maker simultaneously values cumulative revenue performance and the quality of post-hoc inference on revenue contrasts. We analyze a forced-exploration optimism-in-the-face-of-uncertainty (OFU) scheme that combines two regularized maximum-likelihood estimators: one based on all observations for sequential decision making, and one based only on exploration rounds for inference. Our general theory is developed under predictable score proxies and per-round action-dependent curvature domination. Under these conditions we establish a self-normalized concentration inequality, a likelihood-based ellipsoidal confidence-set theorem, and a regret bound for approximate optimistic actions that explicitly accounts for optimization error. For the multinomial logit (MNL) model we derive explicit score and curvature proxies and show that a balanced spaced singleton-exploration schedule yields realized coordinate coverage, implying regret \Otilde(n_T + T/\sqrt{n_T}) and revenue-contrast error \Otilde(1/\sqrt{n_T}) up to fixed problem-dependent factors. A hard two-assortment subclass yields a matching lower bound at the product level. Consequently, within the polynomial exploration family n_T \asymp T^\alpha, the regret and inference rates become \Otilde(T^{\max\{\alpha,1-\alpha/2\}}) and \Otilde(T^{-\alpha/2}), respectively; hence \alpha\in[2/3,1) is the rate-wise Pareto-undominated interval and \alpha=2/3 is the unique balancing point that minimizes the regret exponent. Finally, for the Exponomial Choice and Nested Logit models we state verifiable sufficient conditions that would instantiate the general framework.