VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

arXiv cs.RO / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIndustry & Market MovesModels & Research

Key Points

  • VLA Foundry is an open-source training framework that unifies LLM, VLM, and VLA (vision-language-action) model training within a single codebase.
  • It addresses fragmentation in prior open-source VLA efforts by providing an end-to-end, unified training stack covering language pretraining through action-expert fine-tuning.
  • The framework supports both from-scratch training and workflows that start from pretrained Hugging Face backbones, including Qwen3-VL.
  • The authors demonstrate effectiveness by training and releasing two model variants (from-scratch via an LLM→VLM→VLA pipeline, and a Qwen3-VL–backbone variant) and evaluating them in closed-loop control on LBM Eval.
  • Results show the from-scratch model matches prior closed-source performance in the nominal setting, while the Qwen3-VL-based model significantly improves multi-task tabletop manipulation performance over the baseline.

Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.