Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets
arXiv cs.AI / 3/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that high infraction rates—especially collision-related failures—in closed-loop evaluations are the key bottleneck for end-to-end autonomous driving performance on benchmarks like the CARLA Leaderboard.
- It proposes VLAAD (Video-Language-Augmented Anomaly Detector), using a Multiple Instance Learning setup to produce stable, temporally localized collision signals suitable for proactive prediction and collision-aware representation learning.
- To better train and evaluate collision-aware learning in closed-loop simulation, it introduces CARLA-Collide, a large-scale multimodal simulator dataset covering collision events across diverse road networks rather than limited intersection scenarios.
- The authors show that VLAAD can function as a plug-in module for existing end-to-end driving systems, reporting a 14.12% relative driving-score improvement when integrated into a pretrained TransFuser++ agent with minimal fine-tuning.
- For open-loop and real-world generalization, they introduce Real-Collide (dashcam videos with rich semantic annotations) and report that VLAAD—at only 0.6B parameters—achieves a 23.3% AUC improvement while outperforming a much larger vision-language model.
Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere
Dev.to