ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

ActDistill is proposed as a general “action-guided self-derived distillation” method to compress Vision-Language-Action (VLA) models into lightweight students for faster robotic inference.
The approach uses a well-trained VLA model as a teacher and introduces a graph-structured encapsulation to model the hierarchical evolution of action prediction, then trains a student derived from that encapsulated teacher.
A dynamic router is added to the student to adaptively select computation paths at inference time based on action-prediction demands, supervised with hierarchical, graph-informed signals.
During inference, graph-related auxiliary components are removed so the student can run only the dynamically routed layers, targeting both reduced compute and lower latency.
Experiments on embodied benchmarks reportedly show comparable or better performance than full-scale VLA models while cutting computation by over 50% and achieving up to 1.67× speedup.

Abstract

Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.