STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

arXiv cs.RO / 4/30/2026

📰 NewsModels & Research

Key Points

  • The paper introduces STARRY, a world-model-enhanced action-generation policy for robotic manipulation that better captures action-relevant spatial-temporal interactions.
  • STARRY jointly denoises future spatial-temporal latent variables and action sequences, linking spatial-temporal prediction directly to action generation.
  • It adds Geometry-Aware Selective Attention Modulation that converts predicted depth and end-effector geometry into token-aligned weights to guide selective action attention.
  • Experiments on RoboTwin 2.0 show strong gains, including 93.82%/93.30% average success under Clean and Randomized settings, with real-world success improving from 42.5% to 70.8% over \(\pi_{0.5}\).
  • Overall, the results suggest that action-centric spatial-temporal world modeling can substantially improve robot performance on tasks requiring precise spatial-temporal reasoning.

Abstract

Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5% to 70.8% over \pi_{0.5}, demonstrating the effectiveness of action-centric spatial-temporal world modeling for spatial-temporally demanding robotic action generation.