Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

arXiv cs.AI / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles multi-objective offline reinforcement learning for aligning large language model outputs with conflicting human preferences, which single-objective and linear scalarization methods can’t properly capture.
  • It introduces a smooth Tchebysheff-based scalarization approach to overcome provable failures of linear reward scalarization on non-convex regions of the Pareto front.
  • The authors propose STOMP (Smooth Tchebysheff Optimization of Multi-Objective Preferences), a new offline RL algorithm that extends direct preference optimization to multi-objective settings using standardized rewards from observed data distributions.
  • Experiments on protein engineering tasks align multiple autoregressive protein language models using three lab datasets and show STOMP achieves top hypervolume results in most settings across offline off-policy and generative evaluations.
  • The work positions STOMP as a robust method for multi-attribute post-training alignment that may generalize beyond protein optimization to other multi-criteria domains such as chat safety/helpfulness tradeoffs.

Abstract

Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.