StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
arXiv cs.RO / 4/8/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- StarVLA is presented as an open-source “Lego-like” codebase aimed at making Vision-Language-Action (VLA) model research more modular, swappable, and reproducible.
- It introduces a modular backbone/action-head architecture that supports both vision-language model (e.g., Qwen-VL) and world-model (e.g., Cosmos) backbones, with independent swapping of components.
- The framework includes reusable training strategies such as cross-embodiment learning and multimodal co-training that are consistent across the supported VLA paradigms.
- StarVLA unifies major VLA benchmarks (LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, BEHAVIOR-1K) via a single evaluation interface covering both simulation and real-robot deployment.
- The authors claim the provided single-benchmark training recipes are fully reproducible and can match or surpass prior methods on multiple benchmarks with both backbone types.
Related Articles

Black Hat Asia
AI Business
[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project
Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing
Dev.to

Every AI Agent Registry in 2026, Compared
Dev.to