Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

arXiv cs.LG / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FP-DRL, a reinforcement learning algorithm for trajectory optimization that replaces the common diagonal-Gaussian policy parameterization with a flow-based policy learned via flow matching to better capture multimodal solutions.
  • It combines this flow-based policy representation with distributional RL to learn and optimize the full return distribution (not just an expected return), aiming to provide stronger guidance for policy updates in multi-solution settings.
  • The authors argue that traditional RL’s reliance on mean/expected returns can collapse multimodal structure and limit coverage of optimal behaviors, motivating the distributional treatment.
  • Experiments on MuJoCo benchmarks show FP-DRL reaching state-of-the-art performance on most control tasks and demonstrating improved representational capability compared with baseline flow policy approaches.
  • Overall, the contribution targets improved performance and richer policy representations for complex control/trajectory problems where multiple distinct optimal outcomes exist.

Abstract

Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.