OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

arXiv cs.RO / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces OmniVLA-RL, a new vision-language-action model aimed at improving embodied AI performance despite known issues in spatial perception, multimodal fusion, and reinforcement learning stability.
It uses a Mix-of-Transformers (MoT) architecture that combines specialized “reasoning,” “spatial,” and “action” experts to better integrate multimodal information for action selection.
To improve action precision and training robustness, the authors propose Flow-GSPO, which reformulates flow matching as an SDE-based process and combines it with Group Segmented Policy Optimization (GSPO).
Experiments on the LIBERO and LIBERO-Plus benchmarks show OmniVLA-RL outperforming prior state-of-the-art approaches, addressing core limitations of existing VLA models.
Overall, the work advances the design and training pipeline of VLA systems by coupling spatial understanding improvements with more stable online reinforcement learning.

Abstract

Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL significantly outperforms state-of-the-art methods, effectively overcoming the fundamental limitations of current VLA models.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/21DailyView insight →

A practical guide to getting comfortable with AI coding tools

Dev.to

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

🚀 Major BrowserAct CLI Update

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

Key Points

Abstract

💡 Insights using this article

Related Articles

A practical guide to getting comfortable with AI coding tools

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

🚀 Major BrowserAct CLI Update

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer