Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

arXiv cs.CV / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a new camera–radar fusion paradigm called heterogeneous query interaction for autonomous driving, aiming to improve both sensing complementarity and deployment practicality.
  • It presents ConFusion, a 3D object detector that uses multiple query types—image queries, radar queries, and learnable world queries distributed in 3D space—to enhance query initialization and improve object coverage.
  • To strengthen interaction across query types, the authors propose heterogeneous query mixing (QMix), which applies dedicated cross-type attention after feature sampling to consolidate complementary evidence.
  • They further introduce interactive query swap sampling (QSwap), enabling related queries to exchange informative feature tokens while respecting attention and geometric constraints to improve sampling quality.
  • On nuScenes, ConFusion reports state-of-the-art results with 59.1 mAP / 65.6 NDS on the validation set and 61.6 mAP / 67.9 NDS on the test set.

Abstract

In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix), which performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence. We further propose interactive query swap sampling (QSwap), which improves feature sampling by allowing related queries to exchange informative feature tokens under attention and geometric constraints. Experiments on the nuScenes dataset show that ConFusion achieves state-of-the-art performance, reaching 59.1 mAP and 65.6 NDS on the validation set, and 61.6 mAP and 67.9 NDS on the test set.