Abstract
We present a new robotic foundation model, called {\pi}_{0.7}, that can enable strong out-of-the-box performance in a wide range of scenarios. {\pi}_{0.7} can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level of performance that matches much more specialized RL-finetuned models. The main idea behind {\pi}_{0.7} is to use diverse context conditioning during training. This conditioning information, contained in the prompt, makes it possible to steer the model precisely to perform many tasks with different strategies. It is conditioned not just on a language command that describes what it should do, but on additional multimodal information that also describes the manner or strategy in which it should do it, including metadata about task performance and subgoal images. This enables {\pi}_{0.7} to use very diverse data, including demonstrations, potentially suboptimal (autonomous) data including failures, and data from non-robot sources. Our experiments evaluate {\pi}_{0.7} across numerous tasks with multiple robot platforms, on tasks that require speed and dexterity, language following, and compositional task generalization.