Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

arXiv cs.RO / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses robot transparency by requiring that a robot’s natural language is explicitly consistent with its visual observations and resulting action trajectories.
It introduces a new training framework for hierarchical Vision-Language-Action (VLA) models that performs explicit language–action alignment during training, rather than relying on language generation (e.g., chain-of-thought) and actions separately.
The method uses a contrastive alignment model to rank language–trajectory pairs and applies offline preference learning to refine grounding for each hierarchical sub-task.
Experiments on the LanguageTable benchmark (human-language-annotated trajectories) show that the framework achieves strong performance comparable to fully supervised fine-tuning while reducing reliance on costly data annotations.
Overall, the work provides insights into multimodal grounding representations and establishes a practical baseline for aligned, transparent robot behaviors.

Abstract

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their alignment, allowing us to refine the grounding of our hierarchical VLA through offline preference learning. We apply our framework to the LanguageTable dataset, a benchmark dataset of human language-annotated trajectories, and provide critical insights into multimodal grounding representations, all while establishing a strong baseline that achieves performance comparable to fully supervised fine-tuning and minimizing the need for costly data annotations.

Black Hat Asia

AI Business

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

Dev.to

We are building an OS for AI-built software. Here's what that means

Dev.to

Claude Code Forgot My Code. Here's Why.

Dev.to

Whats'App Ai Assistant

Dev.to

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Key Points

Abstract

Related Articles

Black Hat Asia

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

We are building an OS for AI-built software. Here's what that means

Claude Code Forgot My Code. Here's Why.

Whats'App Ai Assistant

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer