OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

arXiv cs.RO / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces OmniVLA-RL, a new vision-language-action model aimed at improving embodied AI performance despite known issues in spatial perception, multimodal fusion, and reinforcement learning stability.
  • It uses a Mix-of-Transformers (MoT) architecture that combines specialized “reasoning,” “spatial,” and “action” experts to better integrate multimodal information for action selection.
  • To improve action precision and training robustness, the authors propose Flow-GSPO, which reformulates flow matching as an SDE-based process and combines it with Group Segmented Policy Optimization (GSPO).
  • Experiments on the LIBERO and LIBERO-Plus benchmarks show OmniVLA-RL outperforming prior state-of-the-art approaches, addressing core limitations of existing VLA models.
  • Overall, the work advances the design and training pipeline of VLA systems by coupling spatial understanding improvements with more stable online reinforcement learning.

Abstract

Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL significantly outperforms state-of-the-art methods, effectively overcoming the fundamental limitations of current VLA models.