2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness

arXiv cs.RO / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge that Vision-Language-Action models moving from 2D-only to 2D+3D (MVLA) generate more tokens, increasing computational demand for embodied intelligence systems.
  • It argues that existing token pruning methods are not well-suited for MVLA because they ignore differing salience between 2D and 3D modalities.
  • The authors introduce a tri-stage analysis to model the discrepancy and dynamics of 2D/3D modality salience, then use it to build a tri-stage token pruning framework tailored to MVLA.
  • Experiments report up to 2.55x inference speedup with minimal accuracy loss, with an added overhead of 5.8%.
  • The authors note the code will be released soon, indicating the method may be practically deployable after publication.

Abstract

Vision-Language-Action (VLA) models have emerged as the mainstream of embodied intelligence. Recent VLA models have expanded their input modalities from 2D-only to 2D+3D paradigms, forming multi-visual-modal VLA (MVLA) models. Despite achieving improved spatial perception, MVLA faces a greater acceleration demand due to the increased number of input tokens caused by modal expansion. Token pruning is an effective optimization methods tailored to MVLA models. However, existing token pruning schemes are designed for 2D-only VLA models, ignoring 2D/3D modality salience differences. In this paper, we follow the application process of multi-modal data in MVLA models and develop a tri-stage analysis to capture the discrepancy and dynamics of 2D/3D modality salience. Based on these, we propose a corresponding tri-stage token pruning framework for MVLA models to achieve optimal 2D/3D token selection and efficient pruning. Experiments show that our framework achieves up to a 2.55x inference speedup with minimal accuracy loss, while only costing 5.8% overhead. Our Code is coming soon.