Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that real-world robotic manipulation naturally allows a feasible action neighborhood (FAN) where multiple actions produce effectively similar progress, rather than a single correct action.
It introduces a FAN-guided regularizer for vision-language-action (VLA) fine-tuning that reshapes the model’s output distribution using a Gaussian prior to encourage locally smooth, unimodal predictions near the preferred direction and magnitude.
Experiments show the method improves sample efficiency and success rate in both reinforced finetuning (RFT) and supervised finetuning (SFT).
Results are reported as strong not only in-distribution but also out-of-distribution (OOD), suggesting better generalization for VLA adaptation.
The approach is presented as a principled way to match model behavior to the physical manipulation tolerances inherent in robotics, improving both practicality and learning efficiency.

Abstract

In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is, for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and do not exploit the FAN property, thus leading to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation.