Lifting Embodied World Models for Planning and Control

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Embodied world models that predict future observations from actions become hard to plan with when the agent’s action space is high-dimensional (e.g., full joint control for humans).
The paper proposes training a lightweight high-level policy that converts a compact action representation into sequences of low-level joint commands, then composing it with a frozen world model to obtain a “lifted” world model.
The lifted model can predict a sequence of future observations from a single high-level action, using a low-dimensional action interface based on 2D waypoints tied to near-term goals for key joints.
Experiments on a human-like embodiment show substantially improved planning accuracy versus directly searching in low-level joint space (3.8× lower mean joint error), while also being more compute-efficient and generalizing to unseen environments.
The approach emphasizes better controllability and planning tractability by making actions more interpretable and easier to specify or search over.

Abstract

World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space (

3.8\times

lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

Lifting Embodied World Models for Planning and Control

Key Points

Abstract

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer