VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
arXiv cs.RO / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIndustry & Market MovesModels & Research
Key Points
- VLA Foundry is an open-source training framework that unifies LLM, VLM, and VLA (vision-language-action) model training within a single codebase.
- It addresses fragmentation in prior open-source VLA efforts by providing an end-to-end, unified training stack covering language pretraining through action-expert fine-tuning.
- The framework supports both from-scratch training and workflows that start from pretrained Hugging Face backbones, including Qwen3-VL.
- The authors demonstrate effectiveness by training and releasing two model variants (from-scratch via an LLM→VLM→VLA pipeline, and a Qwen3-VL–backbone variant) and evaluating them in closed-loop control on LBM Eval.
- Results show the from-scratch model matches prior closed-source performance in the nominal setting, while the Qwen3-VL-based model significantly improves multi-task tabletop manipulation performance over the baseline.
Related Articles

Black Hat USA
AI Business
v0.20.0rc1
vLLM Releases
Build-in-Public: What I Learned Building an AI Image SaaS
Dev.to

Biotech-led boom as 8 China firms flock to Hong Kong’s thriving stock market
SCMP Tech
I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw
Dev.to