VERDI: VLM-Embedded Reasoning for Autonomous Driving

arXiv cs.RO / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 自動運転では部分観測や現実の複雑さの下での意思決定が難しく、人のようなコモンセンス推論を限られた情報で行える仕組みが課題になっている。
  • 従来のVLMを推論時に用いる軌道計画手法はベンチマークでは有効でも、70B級モデルの推論コスト(低速・大容量メモリ)や単一ネットワーク構造による安全分解の難しさが実運用の障壁になっている。
  • 提案手法VERDIは、推論時にVLMを直接走らせる代わりに、訓練時の蒸留フレームワークとしてVLMの推論プロセスとコモンセンス知識をADスタックへ移し、知識を中間表現レベルでモジュール(知覚・予測・計画)に整合させる。
  • オープンループ/クローズドループ評価で、埋め込み推論なしの既存エンドツーエンド手法に対して最大11%(ℓ2距離)改善し、HugSimのクローズドループで全体性能が最高となり、Non-Collision Rateが10%改善しつつ高速推論も維持できた。

Abstract

While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of applying commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous DrIving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We evaluate VERDI in both open-loop and closed-loop settings. Our method outperforms existing end-to-end approaches without embedded reasoning by up to 11% in \ell_{2} distance, and achieves the best overall driving performance in the closed-loop HugSim simulator, including a 10% improvement in Non-Collision Rate, while maintaining fast inference speed.