Sample-efficient Neuro-symbolic Proximal Policy Optimization
arXiv cs.AI / 4/29/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a neuro-symbolic version of Proximal Policy Optimization (PPO) aimed at reducing data needs in deep reinforcement learning for sparse-reward, long-horizon, multi-subgoal tasks.
- It transfers partially learned logical policy specifications from easier environments to harder ones, using two symbolic-guidance mechanisms to steer learning.
- The first method, H-PPO-Product, biases the action distribution during sampling time, while the second, H-PPO-SymLoss, adds symbolic regularization to the PPO objective.
- Experiments on OfficeWorld, WaterWorld, and DoorKey show faster learning and higher final returns than standard PPO and a Reward Machine baseline, even when the symbolic knowledge is imperfect.
- Overall, the results suggest that incorporating symbolic policy structure can significantly improve reinforcement learning efficiency and robustness in challenging planning problems.


