Flow Matching Policy with Entropy Regularization

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

FMER introduces an Ordinary Differential Equation-based online RL framework that parameterizes the policy via flow matching and samples actions along a straight probability path.
It derives a tractable entropy objective to enable principled maximum-entropy optimization for improved exploration.
The method leverages an advantage-weighted target velocity field derived from a candidate set to steer policy updates toward high-value regions, exploiting the model's generative nature.
Empirical results on sparse multi-goal FrankaKitchen benchmarks show FMER outperforms state-of-the-art methods and remains competitive on MuJoCo, while reducing training time (about 7x faster than heavy diffusion baselines like QVPO and 10-15% faster than efficient variants).
The findings suggest meaningful gains in sample efficiency and computation for diffusion-based RL, with potential impact on robotics and other AI-controlled systems.

Abstract

Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

Dev.to

Flow Matching Policy with Entropy Regularization

Key Points

Abstract

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer