Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

arXiv cs.CV / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • ViT-based sparse multi-view 3D object detectors are accurate but still slow at inference because token processing is computationally heavy.
  • Prior token compression approaches (token pruning/merging and enlarging patch sizes) can remove informative background cues, break contextual consistency, and degrade fine-grained semantics, harming 3D detection.
  • The paper proposes SEPatch3D, which dynamically adjusts patch sizes to retain critical semantic information in coarse patches while cutting computation.
  • SEPatch3D includes SPSS (choosing small patches for nearby-object scenes and large patches for background-dominated scenes), IPS (selecting informative patches for refinement), and CGFE (injecting fine-grained details into coarse patches).
  • Experiments on nuScenes and Argoverse 2 show up to 57% faster inference than the StreamPETR baseline and 20% higher efficiency than ToC3D-faster, with comparable detection accuracy, and the authors provide code on GitHub.

Abstract

Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors | AI Navigate