Tri-Modal Fusion Transformers for UAV-based Object Detection

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes a tri-modal UAV object detection framework that jointly leverages RGB, thermal (LWIR), and event-camera data to handle illumination changes, motion blur, and dynamic scenes that degrade RGB reliability.
It uses a dual-stream hierarchical vision transformer with two key fusion modules—Modality-Aware Gated Exchange (MAGE) and Bidirectional Token Exchange (BiTE)—to exchange information at selected encoder depths and produce resolution-preserving fused feature maps for a feature pyramid and two-stage detector.
The work introduces a new UAV dataset containing 10,489 synchronized and pre-aligned RGB–thermal–event frames and 24,223 annotated vehicles across day and night flights.
Through 61 ablation experiments, the authors find that tri-modal fusion outperforms dual-modal baselines and that fusion depth is critical, while a lightweight CSSA variant can recover most gains at minimal added cost.
The authors position the results as the first systematic benchmark and modular backbone for tri-modal UAV-based object detection, supporting further research and comparisons.

Abstract

Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB-thermal-event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.