AI Navigate

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

arXiv cs.CV / 3/19/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces a Base-Station-Helped UAV (BHU) framework to enable communication-efficient multi-UAV cooperative perception in low-altitude wireless networks.
  • It employs a Top-K pixel selection to sparsify UAV-captured RGB images, transmitting only the most informative pixels to a ground server to cut data volume and latency.
  • The sparsified images are sent via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts BEV features and performs cooperative feature fusion for ground vehicle perception.
  • A diffusion-model-based deep reinforcement learning (DRL) algorithm jointly selects cooperative UAVs, sparsification ratios, and precoding matrices to balance communication efficiency and perception utility.
  • Experimental results on the Air-Co-Pred dataset show over 5% accuracy/perception improvement while reducing communication overhead by about 85% compared to traditional CNN-based BEV fusion baselines.

Abstract

Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.