CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

arXiv cs.RO / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

CrashSight is a new vision-language benchmark that evaluates how well models understand traffic crash scenes using real-world roadside camera footage rather than ego-vehicle-focused data.
The dataset includes 250 crash videos with 13K multiple-choice QA pairs structured in a two-tier taxonomy: Tier 1 tests visual grounding (scene context and parties), while Tier 2 tests higher-level reasoning like crash mechanics, causal attribution, temporal progression, and post-crash outcomes.
Benchmarking eight state-of-the-art VLMs shows that, although they can describe scenes well, they often underperform on temporal and causal reasoning in safety-critical crash scenarios.
The work provides failure analysis and discusses directions for improving VLM crash understanding for infrastructure-assisted perception in cooperative autonomous driving.
The full benchmark dataset and code are released publicly at the project website for standardized evaluation and further research.

Abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Dev.to

Emergency Room and the Vanishing Moat

Dev.to

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Key Points

Abstract

Related Articles

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Emergency Room and the Vanishing Moat

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer