OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that standard Chain-of-Thought prompting is limited for embodied robotics because linear text does not explicitly model state-space, object hierarchies, and causal dependencies needed for planning.
It proposes Object-Oriented World Modeling (OOWM), defining a world model as an explicit symbolic tuple W=⟨S,T⟩ with separate state abstraction and transition/control logic, grounded using UML class and activity diagrams.
OOWM introduces a three-stage training pipeline that combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), using outcome-based rewards to implicitly optimize the object-oriented reasoning structure even with sparse annotations.
Experiments on the MRoom-30k benchmark report improvements over unstructured textual baselines in planning coherence, execution success, and structural fidelity, suggesting a more structured paradigm for embodied reasoning.
The work reframes world modeling from latent vector representations toward software-engineering-like formalisms to better connect perception, reasoning, and executable planning.

Abstract

Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple

W = \langle S, T \rangle

: a State Abstraction (

G_\text{state}

) instantiating the environmental state

S

, coupled with a Control Policy (

G_\text{control}

) representing the transition logic

T: S \times A \rightarrow S'

. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.