Lifting Embodied World Models for Planning and Control

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Embodied world models that predict future observations from actions become hard to plan with when the agent’s action space is high-dimensional (e.g., full joint control for humans).
  • The paper proposes training a lightweight high-level policy that converts a compact action representation into sequences of low-level joint commands, then composing it with a frozen world model to obtain a “lifted” world model.
  • The lifted model can predict a sequence of future observations from a single high-level action, using a low-dimensional action interface based on 2D waypoints tied to near-term goals for key joints.
  • Experiments on a human-like embodiment show substantially improved planning accuracy versus directly searching in low-level joint space (3.8× lower mean joint error), while also being more compute-efficient and generalizing to unseen environments.
  • The approach emphasizes better controllability and planning tractability by making actions more interpretable and easier to specify or search over.

Abstract

World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space (3.8\times lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.