JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

arXiv cs.RO / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces JoyAI-RA 0.1, a vision-language-action (VLA) embodied foundation model designed to improve robotic autonomy in open-world settings.
  • It targets key limitations of prior work, including insufficient diversity in training data and weak cross-embodiment generalization across different robot embodiments.
  • JoyAI-RA uses a multi-source, multi-level pretraining approach that combines web data, large-scale egocentric human manipulation videos, simulation trajectory data, and real-robot data.
  • The model includes explicit action-space unification to bridge embodiment gaps, particularly between human manipulation behaviors and robotic control, improving transfer of learned behaviors.
  • The authors report that JoyAI-RA outperforms state-of-the-art methods on both simulation and real-world benchmarks, especially for diverse tasks requiring generalization.

Abstract

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.