LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

arXiv cs.RO / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

LaMPは、ロボット操作向けのVision-Language-Action（VLA）フレームワークで、2D特徴から直接行動を回帰する従来手法の「暗黙的な3D物理理解の負担」を、3D scene flowを潜在モーション事前分布として埋め込むことで軽減します。
Motion Expertが部分的にノイズ除去した1ステップの3D scene flowを生成し、その隠れ状態をAction Expertへゲート付きクロスアテンションで条件付けすることで、フルの多ステップ再構成なしにアクション予測へつなげる設計になっています。
LIBERO / LIBERO-Plus / SimplerEnv-WidowXのシミュレーションに加え、実世界実験でも既存VLAベースラインを一貫して上回り、同一学習予算での平均成功率が最上位だったと報告されています。
LIBERO-PlusのOOD（外れ分布）摂動では、最強のpriorベースラインに対して平均9.7%の改善を示し、未知の空間ダイナミクスへの頑健性が強化されたことを示唆しています。

Abstract

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.