2D or 3D: Who Governs Salience in VLA Models? -- Tri-Stage Token Pruning Framework with Modality Salience Awareness
arXiv cs.RO / 4/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the challenge that Vision-Language-Action models moving from 2D-only to 2D+3D (MVLA) generate more tokens, increasing computational demand for embodied intelligence systems.
- It argues that existing token pruning methods are not well-suited for MVLA because they ignore differing salience between 2D and 3D modalities.
- The authors introduce a tri-stage analysis to model the discrepancy and dynamics of 2D/3D modality salience, then use it to build a tri-stage token pruning framework tailored to MVLA.
- Experiments report up to 2.55x inference speedup with minimal accuracy loss, with an added overhead of 5.8%.
- The authors note the code will be released soon, indicating the method may be practically deployable after publication.
Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to
Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to
วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)
Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)
Dev.to
Why Domain Knowledge Is Critical in Healthcare Machine Learning
Dev.to