How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
arXiv cs.CV / 4/9/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces VENUSS, a framework to systematically test how vision-language models (VLMs) handle sequential driving scenes under different input configurations.
- Using extracted temporal sequences from existing driving-video datasets, VENUSS evaluates 25+ VLMs across 2,600+ scenarios with structured category settings.
- Results show top VLMs reach only 57% accuracy versus 65% for humans under similar constraints, revealing notable capability gaps.
- The study finds VLMs perform better at static object detection than at modeling vehicle dynamics and temporal relationships in driving.
- VENUSS specifically analyzes sensitivity to presentation factors such as image resolution, frame count, temporal intervals, spatial layouts, and input presentation modes, providing baselines for future work.
Related Articles

Black Hat Asia
AI Business

OpenAI's pricing is about to change — here's why local AI matters more than ever
Dev.to

Google AI Tells Users to Put Glue on Their Pizza!
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Could it be that this take is not too far fetched?
Reddit r/LocalLLaMA