Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Causal Scene Narration (CSN) to restructure Vision-Language-Action (VLA) driving prompts so that intent and relevant environmental constraints are explicitly aligned with grounded, quantitative text at inference time.
CSN is designed to have zero GPU cost during inference and is trained/aligned using Plackett-Luce DPO with negative log-likelihood regularization, alongside Simplex-based runtime safety supervision.
In multi-town closed-loop CARLA experiments, CSN improves Driving Score by +31.1% on LMDrive and +24.5% on a preference-aligned variant, indicating strong end-to-end gains.
Ablation results suggest that causal structure explains 39.1% of the improvement, while the remaining gains come from improved information content, and the benefits persist under realistic perception noise.
The study finds that semantic safety supervision improves Infraction Score, but reactive Time-To-Collision monitoring can worsen performance, implying that intent-aware monitoring is important for VLA driving safety.

Abstract

Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.

Why I built an AI assistant that doesn't know who you are

Dev.to

DenseNet Paper Walkthrough: All Connected

Towards Data Science

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

Dev.to

The Facebook insider building content moderation for the AI era

TechCrunch

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

Reddit r/LocalLLaMA

Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

Key Points

Abstract

Related Articles

Why I built an AI assistant that doesn't know who you are

DenseNet Paper Walkthrough: All Connected

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

The Facebook insider building content moderation for the AI era

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer