StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

arXiv cs.RO / 4/8/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • StarVLA is presented as an open-source “Lego-like” codebase aimed at making Vision-Language-Action (VLA) model research more modular, swappable, and reproducible.
  • It introduces a modular backbone/action-head architecture that supports both vision-language model (e.g., Qwen-VL) and world-model (e.g., Cosmos) backbones, with independent swapping of components.
  • The framework includes reusable training strategies such as cross-embodiment learning and multimodal co-training that are consistent across the supported VLA paradigms.
  • StarVLA unifies major VLA benchmarks (LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, BEHAVIOR-1K) via a single evaluation interface covering both simulation and real-robot deployment.
  • The authors claim the provided single-benchmark training recipes are fully reproducible and can match or surpass prior methods on multiple benchmarks with both backbone types.

Abstract

Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research. StarVLA addresses these challenges in three aspects. First, it provides a modular backbone--action-head architecture that supports both VLM backbones (e.g., Qwen-VL) and world-model backbones (e.g., Cosmos) alongside representative action-decoding paradigms, all under a shared abstraction in which backbone and action head can each be swapped independently. Second, it provides reusable training strategies, including cross-embodiment learning and multimodal co-training, that apply consistently across supported paradigms. Third, it integrates major benchmarks, including LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K, through a unified evaluation interface that supports both simulation and real-robot deployment. StarVLA also ships simple, fully reproducible single-benchmark training recipes that, despite minimal data engineering, already match or surpass prior methods on multiple benchmarks with both VLM and world-model backbones. To our best knowledge, StarVLA is one of the most comprehensive open-source VLA frameworks available, and we expect it to lower the barrier for reproducing existing methods and prototyping new ones. StarVLA is being actively maintained and expanded; we will update this report as the project evolves. The code and documentation are available at https://github.com/starVLA/starVLA.