Abstract
We study online assortment optimization under stochastic choice when a decision maker simultaneously values cumulative revenue performance and the quality of post-hoc inference on revenue contrasts. We analyze a forced-exploration optimism-in-the-face-of-uncertainty (OFU) scheme that combines two regularized maximum-likelihood estimators: one based on all observations for sequential decision making, and one based only on exploration rounds for inference. Our general theory is developed under predictable score proxies and per-round action-dependent curvature domination. Under these conditions we establish a self-normalized concentration inequality, a likelihood-based ellipsoidal confidence-set theorem, and a regret bound for approximate optimistic actions that explicitly accounts for optimization error. For the multinomial logit (MNL) model we derive explicit score and curvature proxies and show that a balanced spaced singleton-exploration schedule yields realized coordinate coverage, implying regret \Otilde(n_T + T/\sqrt{n_T}) and revenue-contrast error \Otilde(1/\sqrt{n_T}) up to fixed problem-dependent factors. A hard two-assortment subclass yields a matching lower bound at the product level. Consequently, within the polynomial exploration family n_T \asymp T^\alpha, the regret and inference rates become \Otilde(T^{\max\{\alpha,1-\alpha/2\}}) and \Otilde(T^{-\alpha/2}), respectively; hence \alpha\in[2/3,1) is the rate-wise Pareto-undominated interval and \alpha=2/3 is the unique balancing point that minimizes the regret exponent. Finally, for the Exponomial Choice and Nested Logit models we state verifiable sufficient conditions that would instantiate the general framework.