$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

arXiv cs.RO / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces $M^2$-VLA, arguing that end-to-end fine-tuning in Vision-Language-Action models can weaken VLM generalization and cause catastrophic forgetting.
It proposes using a generalized VLM as a direct backbone for robotic manipulation, aiming to connect high-level semantic understanding with robot control needs.
To extract task-relevant signals from dense semantic features, the authors develop a Mixture of Layers (MoL) method that selectively emphasizes layer outputs.
For efficient trajectory learning under limited model capacity, the Meta Skill Module (MSM) is designed to incorporate strong inductive biases.
Experiments in simulation and real-world settings, along with generalization/ablation studies, support the architecture’s effectiveness and zero-shot capabilities, with code and pretrained models planned for public release.

Abstract

Current Vision-Language-Action (VLA) models predominantly rely on end-to-end fine-tuning. While effective, this paradigm compromises the inherent generalization capabilities of Vision-Language Models (VLMs) and incurs catastrophic forgetting. To address these limitations, we propose

M^2

-VLA, which demonstrates that a generalized VLM is able to serve as a powerful backbone for robotic manipulation directly. However, it remains a key challenge to bridge the gap between the high-level semantic understanding of VLMs and the precise requirements of robotic control. To overcome this, we introduce the Mixture of Layers (MoL) strategy that selectively extracts task-critical information from dense semantic features. Furthermore, to facilitate efficient trajectory learning under constrained model capacity, we propose a Meta Skill Module (MSM) that integrates strong inductive biases. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our approach. Furthermore, generalization and ablation studies validate the architecture's zero-shot capabilities and confirm the contribution of each key component. Our code and pre-trained models will be made publicly available.

What to Build Still Beats How

Dev.to

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

Dev.to

v0.22.1

Ollama Releases

AI created job descriptions

Reddit r/artificial

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

Dev.to

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

Key Points

Abstract

Related Articles

What to Build Still Beats How

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

v0.22.1

AI created job descriptions

Predictive Compliance: How AI Identifies Your Med Spa's Documentation Risks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer