ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

arXiv cs.RO / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces ST-$\pi$, a new vision-language-action (VLA) model aimed at improving fine-grained spatiotemporal reasoning for robotic manipulation.
ST-$\pi$ uses a spatiotemporal VLM that encodes 4D observations and task instructions, then relies on an LLM to produce causally ordered chunk-level action prompts with spatial and temporal grounding.
It also adds a spatiotemporal action expert that employs a structured dual-generator guidance scheme to jointly model spatial dependencies and temporal causality for step-level action parameter prediction.
To support training and adaptation, the authors release a real-world robotics dataset with structured spatiotemporal annotations and provide code via the linked GitHub repository.
Experiments reported in the work indicate that this explicit, structured spatiotemporal planning plus local control refinement improves performance on manipulation tasks compared with prior approaches that leave such reasoning more implicit.

Abstract

Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-

\pi

, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/21DailyView insight →

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Dev.to

Dify Now Supports IRIS as a Vector Store — Setup Guide

Dev.to

How to build a Claude chatbot with streaming responses in under 50 lines of Node.js

Dev.to

Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)

Dev.to

ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

Key Points

Abstract

💡 Insights using this article

Related Articles

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

Dify Now Supports IRIS as a Vector Store — Setup Guide

How to build a Claude chatbot with streaming responses in under 50 lines of Node.js

Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer