A Semantic Observer Layer for Autonomous Vehicles: Pre-Deployment Feasibility Study of VLMs for Low-Latency Anomaly Detection

arXiv cs.RO / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a “semantic observer layer” for autonomous vehicles that uses a quantized vision-language model (VLM) to detect context-dependent semantic anomalies not captured by pixel-level detectors.
The observer runs at 1–2 Hz in parallel with the AV control loop and can trigger fail-safe handoffs when semantic edge cases are identified.
Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, the authors report ~500 ms inference time and a ~50x speedup versus an unoptimized FP16 baseline on the same hardware, meeting the low-latency timing budget.
Benchmarks across static and video conditions include an analysis of quantization effects, with NF4 showing a major recall collapse (10.6%) that is identified as a key deployment constraint.
The study links performance and latency metrics to hazard/safety goals to argue for pre-deployment feasibility of the proposed semantic observer architecture for embodied-AI AV systems.

Abstract

Semantic anomalies-context-dependent hazards that pixel-level detectors cannot reason about-pose a critical safety risk in autonomous driving. We propose a \emph{semantic observer layer}: a quantized vision-language model (VLM) running at 1--2\,Hz alongside the primary AV control loop, monitoring for semantic edge cases, and triggering fail-safe handoffs when detected. Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, we achieve ~500 ms inference a ~50x speedup over the unoptimized FP16 baseline (no quantization, standard PyTorch attention) on the same hardware--satisfying the observer timing budget. We benchmark accuracy, latency, and quantization behavior in static and video conditions, identify NF4 recall collapse (10.6%) as a hard deployment constraint, and a hazard analysis mapping performance metrics to safety goals. The results establish a pre-deployment feasibility case for the semantic observer architecture on embodied-AI AV platforms.