CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

CAM3DNet is a new sparse, query-based 3D object detection framework for multi-view camera inputs that targets inefficiencies in learning dynamic multi-scale feature information.
The method introduces three core modules: Composite Query (CQ) to project 2D queries into 3D space, Adaptive Self-Attention (ASA) to model interactions among spatiotemporal multi-scale queries, and Multi-Scale Hybrid Sampling (MSHS) to efficiently sample multi-scale object features using deformable attention with camera priors.
The overall architecture uses a backbone plus an FPN encoder, a YOLOX and DepthNet-based ROI head to produce CQ, and then repeatedly applies ASA and MSHS in the decoder to refine detection features.
Experiments on nuScenes, Waymo, and Argoverse show that CAM3DNet outperforms existing camera-based 3D object detection approaches, with additional ablation studies validating the contributions and compute/space costs of CQ, ASA, and MSHS.

Abstract

Query-based 3D object detection methods using multi-view images often struggle to efficiently leverage dynamic multi-scale information, e.g., the relationship between the object features and the geometric of the queries are not sufficiently learned, directly exploring the multi-scale spatiotemporal features will pay too many costs. To address these challenges, we propose CAM3DNet, a novel sparse query-based framework which combines three new modules, composite query (CQ), adaptive self-attention (ASA), and multi-scale hybrid sampling (MSHS). First, the core idea in the CQ module is a multi-scale projection strategy to transform 2D queries into 3D space. Second, the ASA module learns the interactions between the spatiotemporal multi-scale queries. Third, the MSHS module uses the deformable attention mechanism to sample multi-scale object information by considering multi-scales queries, pyramid feature maps, and 2D-camera prior knowledge. The entire model employs a backbone network and a feature pyramid network (FPN) as the encoder, then introduces a YOLOX and a DepthNet as a ROI\_Head to produce CQ, and repeatedly utilizes ASA and MSHS as the decoder to gain detection features. Extensive experiments on the nuScenes, Waymo, and Argoverse benchmark datasets demonstrate the effectiveness of our CAM3DNet, and most existing camera-based 3D object detection methods are outperformed. Besides, we make comprehensive ablation studies to check the individual effect of CQ, ASA, and MSHS, as well as their cost of space and computation complexity.

A practical guide to getting comfortable with AI coding tools

Dev.to

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

🚀 Major BrowserAct CLI Update

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

Key Points

Abstract

Related Articles

A practical guide to getting comfortable with AI coding tools

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

🚀 Major BrowserAct CLI Update

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer