Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

arXiv cs.RO / 4/9/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that autonomous driving systems are highly vulnerable to rare out-of-distribution semantic anomalies and that current VLM-based anomaly detection is often limited to ad hoc prompting of proprietary models.
  • It introduces SAVANT, a model-agnostic, structured reasoning framework that decomposes anomaly detection into layered semantic consistency verification using two phases: structured scene description extraction and multimodal evaluation.
  • Experiments on balanced real-world driving scenarios show SAVANT improves VLM anomaly detection performance, boosting absolute recall by about 18.5% versus prompting baselines.
  • Using the framework, the authors generate a high-confidence dataset by automatically labeling around 10,000 images with a proprietary best model, addressing data scarcity for anomaly detection.
  • They fine-tune a 7B open-source model (Qwen2.5-VL) for single-shot anomaly detection, reporting 90.8% recall and 93.8% accuracy and enabling near-zero-cost local deployment.

Abstract

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs achieve significantly higher scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, SAVANT provides a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://SAV4N7.github.io