Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

arXiv cs.AI / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Interleaved Vision--Language Reasoning (IVLR), a framework for long-horizon robot manipulation that alternates textual subgoals with visual keyframes across the full task horizon.
At inference time, a single multimodal transformer self-generates and caches an explicit reasoning trace from the initial observation and instruction, which then conditions a closed-loop action decoder along with the instruction and current observation.
To train the approach despite missing trace annotations in existing datasets, the authors create pseudo-supervision by temporally segmenting demonstrations and captioning each stage using a vision-language model.
Experiments show strong results on long-horizon benchmarks, achieving 95.5% average success on LIBERO, 92.4% on LIBERO-Long, and 59.4% overall success on SimplerEnv-WidowX, with ablations confirming that both text and vision traces are required.
Stress tests indicate the trace framework degrades moderately under execution perturbations and partial masking, suggesting some robustness to local corruption but limited tolerance for stale or incorrect global plans.

Abstract

Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method{} reaches 95.5\% average success on LIBERO, including 92.4\% on LIBERO-Long, and 59.4\% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

You Are Right — You Don't Need CLAUDE.md

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

You Are Right — You Don't Need CLAUDE.md

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer