FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

arXiv cs.LG / 2026/4/9

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper addresses a key bottleneck in reinforcement-learning post-training for text-to-image diffusion models: larger rollout group sizes improve alignment, but scaling them on large foundation diffusion models is computationally expensive.
It shows that simply using FP4 quantization in diffusion RL rollouts can degrade performance, creating a trade-off between throughput efficiency and training integrity.
The proposed Sol-RL (“Speed-of-light RL”) uses a two-stage pipeline: high-throughput NVFP4 rollouts to build a large candidate pool and extract a contrastive subset, then BF16 regeneration and policy optimization only on the selected high-fidelity samples.
Experiments on SANA, FLUX.1, and SD3.5-L indicate that Sol-RL preserves BF16-quality integrity while leveraging FP4 arithmetic throughput, improving alignment metrics and accelerating training convergence by up to 4.64× at lower cost.

Abstract

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to

4.64\times

, unlocking the power of massive rollout scaling at a fraction of the cost.

Black Hat Asia

AI Business

日立やNEC、フィジカルAIで脱「人月商売」リアルな現場も効率化

日経XTECH

ソフトバンクG、フィジカルAIに名乗り通信がロボにもたらす賢さと速さ

日経XTECH

三井住友カードが「AIオペレーター」電話で円滑に対話、回答内容は顧客別

日経XTECH

Xの画像モザイクツールが追加される＆ポスト自動翻訳機能が日本以外でも展開開始＆xAIが10兆パラメーターのAIを開発中

GIGAZINE

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

要点

Abstract

関連記事

Black Hat Asia

日立やNEC、フィジカルAIで脱「人月商売」リアルな現場も効率化

ソフトバンクG、フィジカルAIに名乗り通信がロボにもたらす賢さと速さ

三井住友カードが「AIオペレーター」電話で円滑に対話、回答内容は顧客別

Xの画像モザイクツールが追加される＆ポスト自動翻訳機能が日本以外でも展開開始＆xAIが10兆パラメーターのAIを開発中

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

Black Hat Asia

日立やNEC、フィジカルAIで脱「人月商売」 リアルな現場も効率化

ソフトバンクG、フィジカルAIに名乗り 通信がロボにもたらす賢さと速さ

三井住友カードが「AIオペレーター」 電話で円滑に対話、回答内容は顧客別

Xの画像モザイクツールが追加される＆ポスト自動翻訳機能が日本以外でも展開開始＆xAIが10兆パラメーターのAIを開発中

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

日立やNEC、フィジカルAIで脱「人月商売」リアルな現場も効率化

ソフトバンクG、フィジカルAIに名乗り通信がロボにもたらす賢さと速さ

三井住友カードが「AIオペレーター」電話で円滑に対話、回答内容は顧客別