RAMP: Hybrid DRL for Online Learning of Numeric Action Models

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces RAMP, a strategy that learns numeric planning action models online through environment interactions instead of relying on offline training with expert traces.
RAMP jointly trains a deep reinforcement learning (DRL) policy and a numeric action model from past experience, using the learned model to plan and choose future actions.
The approach is designed as a positive feedback loop where the planner’s action proposals help improve the RL policy while the RL policy’s exploration generates data to refine the action model.
To bridge numeric planning problems and reinforcement learning, the authors develop Numeric PDDLGym, an automated converter from numeric planning tasks to Gym-compatible environments.
Experiments on IPC numeric planning domains report that RAMP significantly improves over PPO, boosting both solvability and plan quality.

Abstract

Automated planning algorithms require an action model specifying the preconditions and effects of each action, but obtaining such a model is often hard. Learning action models from observations is feasible, but existing algorithms for numeric domains are offline, requiring expert traces as input. We propose the Reinforcement learning, Action Model learning, and Planning (RAMP) strategy for learning numeric planning action models online via interactions with the environment. RAMP simultaneously trains a Deep Reinforcement Learning (DRL) policy, learns a numeric action model from past interactions, and uses that model to plan future actions when possible. These components form a positive feedback loop: the RL policy gathers data to refine the action model, while the planner generates plans to continue training the RL policy. To facilitate this integration of RL and numeric planning, we developed Numeric PDDLGym, an automated framework for converting numeric planning problems to Gym environments. Experimental results on standard IPC numeric domains show that RAMP significantly outperforms PPO, a well-known DRL algorithm, in terms of solvability and plan quality.