A Comparison of Reinforcement Learning and Optimal Control Methods for Path Planning

arXiv cs.RO / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 自律走行を「脅威領域(円形のno-goゾーン)」を回避しつつ目的地へ到達させる経路計画問題として定式化し、従来の最適制御は計算時間が実リアルタイムに間に合わない点を課題として挙げている。
  • DDPG(Deep Deterministic Policy Gradient)により、状態(位置・速度)から一連の実行可能な行動へ直接マッピングする学習ベースの制御を提案し、critic/actorの2つのニューラルネットと報酬設計で安全到達を目指す。
  • DDPGが「安全な経路が保証される開始点の集合(feasible set)」を学習して、事前にタスク達成可能性を見積もれる点を、ミッション計画への有用な情報として示している。
  • 擬似スペクトル法(伝統的な最適制御)と比較した結果、DDPGはより高速に有効な経路を生成できる一方で、到達不能な「infeasible set」も存在し、feasible set内でも経路が必ずしも最適とは限らない。
  • 今後の方向性として、報酬関数の改善でfeasible setを拡大すること、擬似スペクトル法で得られるfeasible setの検証、arc-search IPMへの拡張を挙げている。

Abstract

Path-planning for autonomous vehicles in threat-laden environments is a fundamental challenge. While traditional optimal control methods can find ideal paths, the computational time is often too slow for real-time decision-making. To solve this challenge, we propose a method based on Deep Deterministic Policy Gradient (DDPG) and model the threat as a simple, circular `no-go' zone. A mission failure is claimed if the vehicle enters this `no-go' zone at any time or does not reach a neighborhood of the destination. The DDPG agent is trained to learn a direct mapping from its current state (position and velocity) to a series of feasible actions that guide the agent to safely reach its goal. A reward function and two neural networks, critic and actor, are used to describe the environment and guide the control efforts. The DDPG trains the agent to find the largest possible set of starting points (``feasible set'') wherein a safe path to the goal is guaranteed. This provides critical information for mission planning, showing beforehand whether a task is achievable from a given starting point, assisting pre-mission planning activities. The approach is validated in simulation. A comparison between the DDPG method and a traditional optimal control (pseudo-spectral) method is carried out. The results show that the learning-based agent may produce effective paths while being significantly faster, making it a better fit for real-time applications. However, there are areas (``infeasible set'') where the DDPG agent cannot find paths to the destination, and the paths in the feasible set may not be optimal. These preliminary results guide our future research: (1) improve the reward function to enlarge the DDPG feasible set, (2) examine the feasible set obtained by the pseudo-spectral method, and (3) investigate the arc-search IPM method for the path planning problem.