Mean Flow Policy Optimization

arXiv cs.LG / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces MeanFlow Policy Optimization (MFPO), which replaces diffusion-model policy representations with MeanFlow models for online reinforcement learning to cut training and inference overhead.
It uses a maximum-entropy RL setup and soft policy iteration to encourage exploration while learning MeanFlow-based policies.
MFPO tackles MeanFlow-specific difficulties, including action likelihood evaluation and soft policy improvement steps.
Experiments on MuJoCo and DeepMind Control Suite show MFPO matches or improves on diffusion-based RL baselines while significantly reducing both training and inference time.
The authors provide an open-source implementation of MFPO on GitHub for reproducibility and further experimentation.

Abstract

Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.