Vega: Learning to Drive with Natural Language Instructions

arXiv cs.RO / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

既存の視覚-言語-行動モデルは自動運転で言語を説明・推論に限定しがちで、ユーザー指示の多様性に柔軟に従う点が課題だと述べています。
大規模ドライビングデータセット InstructScene（約10万シーン、指示文と対応する軌跡をアノテーション）を構築し、指示ベースの学習を可能にしています。
ビジョン・言語・ワールドモデル・アクションを統合した Vision-Language-World-Action モデル Vega を提案し、自己回帰で視覚と言語を扱い、拡散モデルで将来予測と軌跡生成を行います。
モーダル間の相互作用のために joint attention を用い、モダリティごとに個別の投影層を設けることで能力拡張を図っています。
実験では計画性能の向上と強い指示追従性が示され、個別最適化されたより知的な運転システムへの道を開くと結論づけています。

Abstract

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Sector HQ Daily AI Intelligence - March 27, 2026

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Vega: Learning to Drive with Natural Language Instructions

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Sector HQ Daily AI Intelligence - March 27, 2026

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer