QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing

arXiv cs.AI / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes QYOLO, a lightweight object detection approach that compresses a YOLO-style backbone by replacing the two deepest C2f modules with a quantum-inspired channel mixing block (QMixBlock).
  • QMixBlock uses sinusoidal global channel recalibration with shared learnable parameters across both backbone stages (P4/16 and P5/32), reducing parameters without needing stage-specific parameter sets.
  • The neck and detection head remain fully classical, so the method targets computational savings primarily in the backbone where channel width scaling drives most of the overhead.
  • Experiments on VisDrone2019 show that QYOLOv8n reduces parameters by 20.2% (3.01M→2.40M) and GFLOPs by 12.3% with only a 0.4 pp drop in mAP@50, and QYOLOv8s reduces parameters by 21.8% with just 0.1 pp degradation.
  • Adding knowledge distillation can recover full accuracy parity at no additional cost to compression, while a larger backbone+neck variant achieves even higher compression (38–41%) but with more accuracy loss, leading to a backbone-only final choice.

Abstract

The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.