DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

arXiv cs.RO / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

DualCoT-VLA is a proposed vision-language-action (VLA) reasoning approach designed to improve performance on complex, multi-step robotic tasks that require both logical planning and fine-grained spatial perception.
The method uses two complementary chain-of-thought components—one visual CoT for low-level spatial understanding and one linguistic CoT for high-level task planning—rather than relying on a single isolated, multimodal reasoning stream.
To address inference latency and compounding errors from autoregressive step-by-step decoding, DualCoT-VLA introduces a parallel reasoning mechanism with two sets of learnable query tokens and reformulates reasoning into single-step forward inference.
The paper reports state-of-the-art results on the LIBERO and RoboCasa GR1 benchmarks and claims effectiveness on real-world robotic platforms.

Abstract

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

Dev.to

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

Dev.to

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

Dev.to

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

THE DECODER

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Key Points

Abstract

Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer