AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

arXiv cs.RO / 4/21/2026

📰 NewsModels & Research

Key Points

  • AnchorRefine addresses a key limitation of many vision-language-action (VLA) policies that optimize global motion and local corrections under a single objective, letting large motions dominate learning.
  • The proposed hierarchical framework factorizes action modeling into a trajectory anchor planner for coarse motion scaffolding and a residual refinement module for execution-level geometric and contact corrections.
  • It also adds a decision-aware gripper refinement mechanism to better handle discrete, boundary-sensitive gripper control.
  • Experiments on LIBERO, CALVIN, and real-robot tasks show consistent improvements across both regression-based and diffusion-based VLA backbones, with up to 7.8% gains in simulation success and 18% in real-world success.

Abstract

Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.