V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

arXiv cs.RO / 4/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces V2X-QA, a real-world multimodal large language model (MLLM) dataset and benchmark for autonomous driving that evaluates models across vehicle-side, infrastructure-side, and cooperative viewpoints rather than only ego-centric scenarios.
V2X-QA uses a view-decoupled evaluation protocol with a unified multiple-choice question answering (MCQA) framework, enabling controlled comparisons under vehicle-only, infrastructure-only, and cooperative driving conditions.
The benchmark is organized into a twelve-task taxonomy covering perception, prediction, reasoning, and planning, with expert-verified MCQA annotations designed to support fine-grained diagnosis of viewpoint-dependent strengths and weaknesses.
Experiments across ten state-of-the-art models show that access to viewpoint information significantly affects performance, that infrastructure-side reasoning improves macroscopic traffic understanding, and that cooperative reasoning remains difficult due to cross-view alignment and evidence integration needs.
To address these issues, the authors propose V2X-MoE, a benchmark-aligned baseline featuring explicit view routing and viewpoint-specific LoRA experts, and find that viewpoint specialization improves multi-view reasoning performance.

Abstract

Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.