KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

arXiv cs.RO / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes KERV, a kinematic-rectified speculative decoding framework that combines token-domain Vision-Language-Action (VLA) models with kinematics-domain prediction to improve inference speed.
  • It uses a kinematics-based Kalman Filter to predict actions and compensate for speculative decoding token errors, aiming to avoid expensive re-inference.
  • It introduces a kinematics-based strategy to dynamically adjust the speculative decoding acceptance threshold, reducing the need for careful manual tuning.
  • Experiments across multiple tasks and environments show KERV delivers about 27%–37% acceleration with nearly no loss in Success Rate.

Abstract

Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.