ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes ETA-VLA, an efficient token adaptation method for Vision-Language-Action (VLA) models used in autonomous driving, targeting the heavy cost of incorporating historical multi-view frames.
  • ETA-VLA introduces the Intra-LLM Sparse Aggregator (ILSA), which uses text-guided scoring and temporal consistency to dynamically prune redundant visual tokens while keeping a representative subset for scene understanding.
  • The approach is motivated by how human drivers allocate attention, aiming to preserve temporal reasoning accuracy without incurring the quadratic self-attention overhead typical of large models.
  • Experiments on NAVSIM v2 show ETA-VLA matches state-of-the-art driving performance while cutting computational FLOPs by about 32%, and reports pruning 85% of visual tokens with a 61% FLOP reduction while retaining roughly 94% of the original accuracy.
  • Overall, the work demonstrates a practical efficiency–accuracy tradeoff that could make VLA-based driving systems more computationally feasible for real-time inference.

Abstract

The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptation framework for VLA models. ETA-VLA processes the past n frames of multi-view images and introduces a novel Intra-LLM Sparse Aggregator (ILSA). Drawing inspiration from human driver attention allocation, ILSA dynamically identifies and prunes redundant visual tokens guided by textual queries and temporal consistency. Specifically, we utilize a text-guided scoring mechanism alongside a diversity-preserving sparsification strategy to select a sparse subset of critical tokens, ensuring comprehensive awareness of the driving scene. Extensive experiments on the NAVSIM v2 demonstrate that ETA-VLA achieves driving performance comparable to state-of-the-art baselines while reducing computational FLOPs by approximately 32\%. Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61\%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.