HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

arXiv cs.RO / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces HCLSM, an object-centric world modeling architecture that addresses limitations of “flat” latent states by decomposing scenes into slots, modeling temporal dynamics hierarchically, and learning causal structure via interaction graphs.
HCLSM combines three coordinated components: slot attention with spatial broadcast decoding for objects, a three-level temporal engine (SSMs for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals) to avoid collapsing time into one scale.
It uses graph neural network interaction patterns to infer causal structure, producing learned event boundaries during training and improved next-state prediction.
Experiments on the PushT robotic manipulation benchmark (Open X-Embodiment) show strong results (0.008 MSE next-state prediction loss) alongside spatial decomposition effectiveness (SBD loss 0.0075).
The work includes substantial systems engineering, such as a custom Triton kernel for SSM scanning that reportedly achieves a 38× speedup, and provides code with a fairly rigorous test suite.

Abstract

World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm

Black Hat Asia

AI Business

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Dev.to

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling

Key Points

Abstract

Related Articles

Black Hat Asia

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Does the Claude “leak” actually change anything in practice?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer