Move-Then-Operate: Behavioral Phasing for Human-Like Robotic Manipulation

arXiv cs.RO / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Move-Then-Operate,” a vision-language action framework that splits robotic manipulation into two phases: coarse relocation (“move”) and contact-critical interaction (“operate”).
  • Instead of using a single monolithic policy, it uses a dual-expert policy with a learnable phase selector to isolate phase-specific dynamics and improve structural learning.
  • Phase labels are generated automatically using an MLLM-based pipeline, using contextual cues like end-effector velocity and subtask decomposition to better align robot behavior with human motor patterns.
  • On the RoboTwin2 benchmark, the method reaches a 68.9% average success rate, improving over the monolithic baseline by 24%, and achieving similar or better results than models trained on 10× more data with 40% fewer training steps.

Abstract

We present Move-Then-Operate, a Vision language action framework that explicitly decouples robotic manipulation into two distinct behavioral phases: coarse relocation (move) and contact-critical interaction (operate). Unlike monolithic policies that conflate these heterogeneous regimes, our architecture employs a dual-expert policy routed by a learnable phase selector, introducing a structural inductive bias that isolates phase-specific dynamics. Phase labels are automatically generated via an MLLM-based pipeline conditioned on lightweight contextual cues such as end-effector velocity and subtask decomposition to ensure alignment with human motor patterns. Evaluated on the RoboTwin2 benchmark, our method achieves an average success rate of 68.9\%, outperforming the monolithic \pi_0 baseline by 24\%. It matches or exceeds models trained on 10\times more data and reaches peak performance in 40\% fewer training steps, demonstrating that architectural disentanglement of move and operate phases is a highly effective and efficient strategy for mastering high-precision manipulation.