ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

arXiv cs.RO / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces ST-$\pi$, a new vision-language-action (VLA) model aimed at improving fine-grained spatiotemporal reasoning for robotic manipulation.
  • ST-$\pi$ uses a spatiotemporal VLM that encodes 4D observations and task instructions, then relies on an LLM to produce causally ordered chunk-level action prompts with spatial and temporal grounding.
  • It also adds a spatiotemporal action expert that employs a structured dual-generator guidance scheme to jointly model spatial dependencies and temporal causality for step-level action parameter prediction.
  • To support training and adaptation, the authors release a real-world robotics dataset with structured spatiotemporal annotations and provide code via the linked GitHub repository.
  • Experiments reported in the work indicate that this explicit, structured spatiotemporal planning plus local control refinement improves performance on manipulation tasks compared with prior approaches that leave such reasoning more implicit.

Abstract

Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-\pi, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.