Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the high computational cost of multi-view 3D object detection that typically relies on large-scale pre-trained ViT-based foundation models as backbones.
  • It proposes a new approach combining an image token compensator with dynamic layer-wise token selection to improve over ToC3D, which uses fixed per-layer token selection ratios.
  • The work introduces parameter-efficient fine-tuning that trains only the proposed modules, cutting fine-tuned parameters from over 300M to about 1.6M versus full end-to-end ViT retraining.
  • Experiments on the NuScenes dataset across three multi-view 3D detection approaches show large efficiency gains (48–55% fewer GFLOPs and 9–25% lower inference latency on an NVIDIA GV100) alongside accuracy improvements (1.0–2.8% mAP and 0.4–1.2% NuScenes detection score) compared to ToC3D.

Abstract

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than 300 million (M) to only 1.6 M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by 48\% ... 55\%, inference latency (on an \texttt{NVIDIA-GV100} GPU) by 9\% ... 25\%, while still improving mean average precision by 1.0\% ... 2.8\% absolute and NuScenes detection score by 0.4\% ... 1.2\% absolute compared to so-far SOTA \texttt{ToC3D}.