HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

arXiv cs.RO / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • Vision-Language-Action(VLA)モデルはロボット制御で主流になりつつある一方、推論が遅いという課題があり、加速手法としてSpeculative Decoding(SD)が注目されています。
  • SDには「drafter-based」と「retrieval-based」の2系統があり、それぞれ強みと弱みが補完的であるため、両者を組み合わせたハイブリッド化が有効ではないかという仮説が提示されています。
  • しかしVLAでのハイブリッドSD実装には、リトリーバル側での下書き拒否や継続的な誤り、さらにハイブリッド境界の決定が難しいといった課題があると分析されます。
  • これらに対処するため、HeiSDは「verify-skip」や「sequence-wise relaxed acceptance」に基づくリトリーバルベース最適化、さらに運動(kinematic)を用いた融合評価指標でハイブリッド境界を自動決定する枠組みを提案しています。
  • 実験ではHeiSDがシミュレーションで最大2.45倍、実環境で2.06〜2.41倍の速度向上を達成しつつ、高いタスク成功率を維持したと報告されています。

Abstract

Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Each of the two methods demonstrates complementary advantages and limitations when applied to VLA models, leading to the hypothesis that a hybrid approach integrating these two methods will yield better performance. In this paper, we first conduct a series of detailed analyses to reveal the advantages and feasibility of hybrid utilization. However, even with the aforementioned key insights, implementing hybrid SD in VLA models presents several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD, which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.