${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

arXiv cs.LG / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces pi0.7, a steerable robotic foundation model designed to perform well out of the box across many scenarios without needing task-specific retraining.
  • pi0.7 can follow complex language instructions in unseen environments, including multi-stage kitchen tasks, and it demonstrates zero-shot cross-embodiment generalization (e.g., folding laundry without prior exposure).
  • The model matches the performance of more specialized reinforcement-learning fine-tuned models on challenging tasks such as operating an espresso machine in a zero-shot setting.
  • Its core approach is “diverse context conditioning” during training, where prompts include not only language goals but additional multimodal steering signals (e.g., performance metadata and subgoal images) that encode strategies.
  • Training leverages a wide range of data sources, including demonstrations, possibly suboptimal autonomous/failure data, and data collected outside of robotics, and is evaluated across multiple robot platforms and task types.

Abstract

We present a new robotic foundation model, called {\pi}_{0.7}, that can enable strong out-of-the-box performance in a wide range of scenarios. {\pi}_{0.7} can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level of performance that matches much more specialized RL-finetuned models. The main idea behind {\pi}_{0.7} is to use diverse context conditioning during training. This conditioning information, contained in the prompt, makes it possible to steer the model precisely to perform many tasks with different strategies. It is conditioned not just on a language command that describes what it should do, but on additional multimodal information that also describes the manner or strategy in which it should do it, including metadata about task performance and subgoal images. This enables {\pi}_{0.7} to use very diverse data, including demonstrations, potentially suboptimal (autonomous) data including failures, and data from non-robot sources. Our experiments evaluate {\pi}_{0.7} across numerous tasks with multiple robot platforms, on tasks that require speed and dexterity, language following, and compositional task generalization.