Adaptive Simulation Experiment for LLM Policy Optimization

arXiv cs.LG / 4/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes treating large language models as stochastic simulators to optimize a response-quality/user-experience policy selected from a finite candidate set.
  • It introduces a pairwise-comparison-based adaptive simulation experiment framework and studies two policy spaces: an unstructured (non-parametric) space and a structured space generated from a preference model.
  • The authors derive the fundamental data requirements for high-probability identification of the optimal policy in both settings, including closed-form optimal sampling proportions for the unstructured case.
  • For the structured setting, they provide a regularized convex optimization formulation to compute optimal sampling proportions.
  • The proposed adaptive procedure, LLM-PO, comes with theoretical guarantees and numerical results showing it outperforms benchmark methods and improves LLM performance.

Abstract

Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.