Enhancing Policy Learning with World-Action Model

arXiv cs.AI / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces the World-Action Model (WAM), an action-regularized world model that predicts future visual observations while jointly learning action-driven state transitions via an inverse dynamics objective added to DreamerV2.
By encouraging latent representations to capture action-relevant structure, WAM aims to improve downstream control performance compared with image-prediction-only world models.
Experiments on eight CALVIN manipulation tasks show that WAM boosts behavioral cloning success from 59.4% to 71.2% versus DreamerV2/DiWA baselines using an identical policy architecture and training procedure.
After PPO fine-tuning inside a frozen world model, WAM reaches 92.8% average success versus 79.8% for the baseline, including two tasks at 100% success.
The approach achieves the reported PPO gains with 8.7x fewer training steps, suggesting improved sample efficiency for model-based policy learning.

Abstract

This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.