StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

arXiv cs.RO / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • StaMoは、軽量エンコーダと事前学習済みDiffusion Transformer(DiT)デコーダを用いて、2トークンから成る非常に圧縮された状態表現を教師なしで獲得する手法を提案する。
  • この圧縮表現は既存のVLA(Visual-Language-Action)系モデルに容易に組み込め、LIBEROで14.3%の性能向上や実環境でのタスク成功率30%改善を含む効果が報告されている。
  • 2トークン間の差分(潜在補間で得られる)が「潜在アクション」として自然に機能し、さらにロボットの実行アクションへ復号できることが示されている。
  • 潜在アクションはポリシーのコトレーニングにも有効で、既存手法より10.4%上回り、解釈可能性も改善したとされる。
  • StaMoは静止画像から状態表現をエンコードし、動画や複雑なアーキテクチャに依存しがちな既存アプローチに対し、実ロボデータ・シミュレーション・人の視点動画など多様なデータソースへスケールする。

Abstract

A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 14.3% on LIBERO and 30% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.