How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces VENUSS, a framework to systematically test how vision-language models (VLMs) handle sequential driving scenes under different input configurations.
Using extracted temporal sequences from existing driving-video datasets, VENUSS evaluates 25+ VLMs across 2,600+ scenarios with structured category settings.
Results show top VLMs reach only 57% accuracy versus 65% for humans under similar constraints, revealing notable capability gaps.
The study finds VLMs perform better at static object detection than at modeling vehicle dynamics and temporal relationships in driving.
VENUSS specifically analyzes sensitivity to presentation factors such as image resolution, frame count, temporal intervals, spatial layouts, and input presentation modes, providing baselines for future work.

Abstract

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

Black Hat Asia

AI Business

OpenAI's pricing is about to change — here's why local AI matters more than ever

Dev.to

Google AI Tells Users to Put Glue on Their Pizza!

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Could it be that this take is not too far fetched?

Reddit r/LocalLLaMA

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Key Points

Abstract

Related Articles

Black Hat Asia

OpenAI's pricing is about to change — here's why local AI matters more than ever

Google AI Tells Users to Put Glue on Their Pizza!

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Could it be that this take is not too far fetched?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer