Distributionally Robust Token Optimization in RLHF

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Distributionally Robust Token Optimization (DRTO), combining token-level RLHF with Distributionally Robust Optimization (DRO) to reduce large failures from small prompt or distribution shifts.
  • DRTO provides theoretical robustness by bounding worst-case token-wise rewards using an f-divergence ambiguity set over a loss minibatch.
  • Experiments on mathematical reasoning benchmarks show improved consistency under distribution shifts, reporting a 9.17% gain on GSM8K and a 2.49% gain on MathQA.
  • The approach targets multi-step reasoning reliability by optimizing at the token level rather than relying only on standard RLHF training signals.
  • The results suggest DRTO-style robust optimization could improve practical LLM performance when real user inputs deviate slightly from training distributions.

Abstract

Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.