AI Navigate

Flow Matching Policy with Entropy Regularization

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • FMER introduces an Ordinary Differential Equation-based online RL framework that parameterizes the policy via flow matching and samples actions along a straight probability path.
  • It derives a tractable entropy objective to enable principled maximum-entropy optimization for improved exploration.
  • The method leverages an advantage-weighted target velocity field derived from a candidate set to steer policy updates toward high-value regions, exploiting the model's generative nature.
  • Empirical results on sparse multi-goal FrankaKitchen benchmarks show FMER outperforms state-of-the-art methods and remains competitive on MuJoCo, while reducing training time (about 7x faster than heavy diffusion baselines like QVPO and 10-15% faster than efficient variants).
  • The findings suggest meaningful gains in sample efficiency and computation for diffusion-based RL, with potential impact on robotics and other AI-controlled systems.

Abstract

Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.