AI Navigate

VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces VLM-AutoDrive, a modular post-training framework that adapts pretrained Vision-Language Models to high-fidelity anomaly detection for safety-critical autonomous driving events.
  • It uses metadata-derived captions, LLM-generated descriptions, VQA pairs, and chain-of-thought supervision to enable domain-aligned, interpretable learning.
  • On real Nexar dashcam videos, fine-tuning with VLM-AutoDrive raises Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%.
  • The approach provides a scalable recipe for bridging perception, causality, and decision making in autonomous driving, with interpretable reasoning traces.

Abstract

The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.