AI Navigate

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

arXiv cs.CV / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a Cross-spectral Traffic Cognition Network (CTCNet) designed for robust UAV traffic scene understanding by fusing optical and thermal modalities to handle adverse illumination.
  • It features a Prototype-Guided Knowledge Embedding (PGKE) module that uses external Traffic Regulation Memory (TRM) prototypes to ground visual representations with domain-specific regulatory knowledge for recognizing complex traffic behaviors.
  • It also includes a Quality-Aware Spectral Compensation (QASC) module that enables bidirectional context exchange between optical and thermal streams to mitigate degraded features in challenging environments.
  • The authors release Traffic-VQA, the first large-scale optical-thermal UAV traffic understanding benchmark (8,180 image pairs and 1.3 million QA pairs across 31 types), and report that CTCNet significantly outperforms state-of-the-art methods; the dataset is publicly available on GitHub.

Abstract

Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.