Leveraging Previous-Traversal Point Cloud Map Priors for Camera-Based 3D Object Detection and Tracking

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper addresses camera-only 3D object detection and tracking in autonomous driving, where depth ambiguity limits precise 3D localization without expensive online LiDAR at inference.
  • It proposes DualViewMapDet, which retrieves static geometric priors from previously traversed environments by using point-cloud maps as an online cue during deployment.
  • The method uses a dual-space camera–map fusion strategy by integrating both perspective-view (PV) features and direct bird’s-eye view (BEV) map encoding, then fuses them in a shared metric space.
  • Experiments on nuScenes and Argoverse 2 show consistent improvements over strong camera-only baselines, with especially large gains in object localization, and ablations confirm the value of PV/BEV fusion and map coverage.
  • The authors release code and pre-trained models publicly to support replication and further research.

Abstract

Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key idea is a dual-space camera-map fusion strategy that avoids one-sided view conversion. Specifically, we (i) project the map into perspective view (PV) and encode multi-channel geometric cues to enrich image features and support BEV lifting, and (ii) encode the map directly in bird's-eye view (BEV) with a sparse voxel backbone and fuse it with lifted camera features in a shared metric space. Extensive evaluations on nuScenes and Argoverse 2 demonstrate consistent improvements over strong camera-only baselines, with particularly strong gains in object localization. Ablations further validate the contributions of PV/BEV fusion and prior-map coverage. We make the code and pre-trained models available at https://dualviewmapdet.cs.uni-freiburg.de .