AI Navigate

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

arXiv cs.CV / 3/13/2026

📰 NewsModels & Research

Key Points

  • DriveXQA introduces a new multimodal autonomous driving VQA dataset with four visual modalities, five sensor failure cases, and five weather conditions, totaling 102,505 QA pairs across global, allocentric, and ego-vehicle-centric levels.
  • The work addresses the gap in using Multimodal Large Language Models for integrating multiple sensor modalities to understand adverse driving scenes.
  • The authors propose MVX-LLM, a token-efficient architecture with a Dual Cross-Attention projector to fuse modalities and reduce information redundancy, achieving improved performance under challenging conditions like fog.
  • The dataset and source code will be publicly released, enabling further research and benchmarking in cross-modal perception for autonomous driving.

Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes 102,505 QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: 53.5 vs. 25.1 for the baseline). The established dataset and source code will be made publicly available.