VLMはセマンティック異常検知を解き放てるか？構造化された推論のための枠組み

arXiv cs.RO / 2026/4/9

💬 オピニオンIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

要点

本論文は、自動運転システムがまれな分布外（OOD）のセマンティック異常に対して非常に脆弱であり、現在のVLMベースの異常検知は、多くの場合、独自モデルに対する場当たり的なプロンプトに留まっていると主張する。
そこで提案するのがSAVANTであり、モデルに依存しない構造化推論の枠組みである。異常検知を、2段階の手順によって層状のセマンティック整合性検証へ分解する：すなわち、構造化されたシーン記述の抽出と、マルチモーダル評価である。
バランスの取れた実世界の運転シナリオに対する実験により、SAVANTはVLMの異常検知性能を改善し、プロンプトのベースラインに比べて絶対リコールを約18.5%向上させることが示される。
この枠組みを用いて、著者らは高信頼度データセットを生成する。独自の最良モデルによって約10,000枚の画像に自動ラベリングを行い、異常検知におけるデータ不足の問題に対処する。
7Bのオープンソースモデル（Qwen2.5-VL）を単発（single-shot）の異常検知用に微調整し、リコール90.8%、精度93.8%を報告する。これにより、ローカル展開のコストをほぼゼロに近づけられるとしている。

Abstract

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs achieve significantly higher scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, SAVANT provides a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://SAV4N7.github.io

Black Hat USA

AI Business

Black Hat Asia

AI Business

日立やNEC、フィジカルAIで脱「人月商売」リアルな現場も効率化

日経XTECH

ソフトバンクG、フィジカルAIに名乗り通信がロボにもたらす賢さと速さ

日経XTECH

Xの画像モザイクツールが追加される＆ポスト自動翻訳機能が日本以外でも展開開始＆xAIが10兆パラメーターのAIを開発中

GIGAZINE

VLMはセマンティック異常検知を解き放てるか？構造化された推論のための枠組み

要点

Abstract

関連記事

Black Hat USA

Black Hat Asia

日立やNEC、フィジカルAIで脱「人月商売」リアルな現場も効率化

ソフトバンクG、フィジカルAIに名乗り通信がロボにもたらす賢さと速さ

Xの画像モザイクツールが追加される＆ポスト自動翻訳機能が日本以外でも展開開始＆xAIが10兆パラメーターのAIを開発中

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

Black Hat USA

Black Hat Asia

日立やNEC、フィジカルAIで脱「人月商売」 リアルな現場も効率化

ソフトバンクG、フィジカルAIに名乗り 通信がロボにもたらす賢さと速さ

Xの画像モザイクツールが追加される＆ポスト自動翻訳機能が日本以外でも展開開始＆xAIが10兆パラメーターのAIを開発中

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

日立やNEC、フィジカルAIで脱「人月商売」リアルな現場も効率化

ソフトバンクG、フィジカルAIに名乗り通信がロボにもたらす賢さと速さ