Vega: Learning to Drive with Natural Language Instructions

arXiv cs.RO / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 既存の視覚-言語-行動モデルは自動運転で言語を説明・推論に限定しがちで、ユーザー指示の多様性に柔軟に従う点が課題だと述べています。
  • 大規模ドライビングデータセット InstructScene(約10万シーン、指示文と対応する軌跡をアノテーション)を構築し、指示ベースの学習を可能にしています。
  • ビジョン・言語・ワールドモデル・アクションを統合した Vision-Language-World-Action モデル Vega を提案し、自己回帰で視覚と言語を扱い、拡散モデルで将来予測と軌跡生成を行います。
  • モーダル間の相互作用のために joint attention を用い、モダリティごとに個別の投影層を設けることで能力拡張を図っています。
  • 実験では計画性能の向上と強い指示追従性が示され、個別最適化されたより知的な運転システムへの道を開くと結論づけています。

Abstract

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.