Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that high infraction rates—especially collision-related failures—in closed-loop evaluations are the key bottleneck for end-to-end autonomous driving performance on benchmarks like the CARLA Leaderboard.
  • It proposes VLAAD (Video-Language-Augmented Anomaly Detector), using a Multiple Instance Learning setup to produce stable, temporally localized collision signals suitable for proactive prediction and collision-aware representation learning.
  • To better train and evaluate collision-aware learning in closed-loop simulation, it introduces CARLA-Collide, a large-scale multimodal simulator dataset covering collision events across diverse road networks rather than limited intersection scenarios.
  • The authors show that VLAAD can function as a plug-in module for existing end-to-end driving systems, reporting a 14.12% relative driving-score improvement when integrated into a pretrained TransFuser++ agent with minimal fine-tuning.
  • For open-loop and real-world generalization, they introduce Real-Collide (dashcam videos with rich semantic annotations) and report that VLAAD—at only 0.6B parameters—achieves a 23.3% AUC improvement while outperforming a much larger vision-language model.

Abstract

High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.