ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

arXiv cs.CV / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • ST-Prune is a training-free, plug-and-play token pruning framework for vision-language models used in autonomous driving, targeting the heavy compute cost of multi-view, multi-frame inputs.
  • It combines Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP) to remove spatio-temporal redundancy that existing pruning methods miss by treating frames/views independently.
  • MTP prioritizes motion volatility and temporal recency in the diversity selection objective so that dynamic trajectories and current-frame content are retained over static history.
  • RSP uses ring-view camera geometry to penalize bilateral cross-view similarity, reducing duplicate projections and residual background that temporal pruning cannot eliminate.
  • Evaluated on four autonomous-driving-related benchmarks, ST-Prune achieves new state-of-the-art results for training-free token pruning, including near-lossless performance at 90% token reduction with inference speed comparable to prior pruning methods.

Abstract

Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.