Learning to Staff: Offline Reinforcement Learning and Fine-Tuned LLMs for Warehouse Staffing Optimization

arXiv cs.LG / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies machine learning methods to optimize real-time warehouse staffing decisions in semi-automated sortation systems, evaluating trade-offs across decision abstractions.
It shows that custom Transformer policies trained with offline reinforcement learning on rich historical state representations can improve simulated throughput by 2.4% versus historical baselines.
For higher-level, human-readable decision inputs, the authors test LLM-based approaches, comparing prompting, automatic prompt optimization, and fine-tuning strategies.
They find that prompting alone is insufficient, but supervised fine-tuning combined with Direct Preference Optimization (using simulator-generated preference data) can match or slightly exceed hand-crafted simulator baselines.
Overall, the work argues both offline RL (for task-specific architectures) and fine-tuned LLMs (for interpretable state abstraction and preference feedback loops) are viable for AI-assisted operational staffing.

Abstract

We investigate machine learning approaches for optimizing real-time staffing decisions in semi-automated warehouse sortation systems. Operational decision-making can be supported at different levels of abstraction, with different trade-offs. We evaluate two approaches, each in a matching simulation environment. First, we train custom Transformer-based policies using offline reinforcement learning on detailed historical state representations, achieving a 2.4% throughput improvement over historical baselines in learned simulators. In high-volume warehouse operations, improvements of this size translate to significant savings. Second, we explore LLMs operating on abstracted, human-readable state descriptions. These are a natural fit for decisions that warehouse managers make using high-level operational summaries. We systematically compare prompting techniques, automatic prompt optimization, and fine-tuning strategies. While prompting alone proves insufficient, supervised fine-tuning combined with Direct Preference Optimization on simulator-generated preferences achieves performance that matches or slightly exceeds historical baselines in a hand-crafted simulator. Our findings demonstrate that both approaches offer viable paths toward AI-assisted operational decision-making. Offline RL excels with task-specific architectures. LLMs support human-readable inputs and can be combined with an iterative feedback loop that can incorporate manager preferences.