ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that typical RL post-training for generative models often uses a single scalar reward, which forces early scalarization into a fixed weighted sum and removes inference-time flexibility for conflicting objectives.
It introduces ParetoSlider, a multi-objective RL framework that trains one diffusion model to approximate the full Pareto front by conditioning on continuously varying preference weights.
This design allows users to select and navigate reward trade-offs during inference without retraining or keeping multiple model checkpoints.
The approach is evaluated using three flow-matching diffusion backbones (SD3.5, FluxKontext, and LTX-2), where the single preference-conditioned model matches or outperforms baselines trained for specific fixed trade-offs.
The key benefit claimed is fine-grained control over competing generative goals (e.g., balancing prompt adherence against source fidelity for image editing) that prior methods lack.

Abstract

Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.