Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the high computational cost of multi-view 3D object detection that typically relies on large-scale pre-trained ViT-based foundation models as backbones.
It proposes a new approach combining an image token compensator with dynamic layer-wise token selection to improve over ToC3D, which uses fixed per-layer token selection ratios.
The work introduces parameter-efficient fine-tuning that trains only the proposed modules, cutting fine-tuned parameters from over 300M to about 1.6M versus full end-to-end ViT retraining.
Experiments on the NuScenes dataset across three multi-view 3D detection approaches show large efficiency gains (48–55% fewer GFLOPs and 9–25% lower inference latency on an NVIDIA GV100) alongside accuracy improvements (1.0–2.8% mAP and 0.4–1.2% NuScenes detection score) compared to ToC3D.

Abstract

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than

300

million (M) to only

1.6

M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by

48\%

...

55\%

, inference latency (on an \texttt{NVIDIA-GV100} GPU) by

9\%

...

25\%

, while still improving mean average precision by

1.0\%

...

2.8\%

absolute and NuScenes detection score by

0.4\%

...

1.2\%

absolute compared to so-far SOTA \texttt{ToC3D}.

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

Dev.to

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

Key Points

Abstract

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer