Long-Horizon Manipulation via Trace-Conditioned VLA Planning

arXiv cs.RO / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

LoHo-Manip is a modular vision-language-action (VLA) framework that extends short-horizon manipulation policies to long-horizon, multi-step instruction following by adding a dedicated task-management VLM.
The manager and executor are decoupled: the manager predicts a progress-aware remaining plan at each step using a lightweight language memory (done + remaining) and a visual trace (a 2D keypoint trajectory prompt).
The executor VLA is adapted to condition on the rendered trace, converting long-horizon planning into repeated local control by following the trace step-by-step.
By re-planning from updated traces every step, LoHo-Manip forms an implicit closed loop where failures persist into later outputs and continuation/replanning occurs automatically without hand-crafted recovery or brittle visual-history buffers.
Experiments in simulation and on a real Franka robot show strong improvements in long-horizon success, robustness, and out-of-distribution generalization across multiple embodied planning and manipulation settings.

Abstract

Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM. The manager is decoupled from the executor and is invoked in a receding-horizon manner: given the current observation, it predicts a progress-aware remaining plan that combines (i) a subtask sequence with an explicit done + remaining split as lightweight language memory, and (ii) a visual trace -- a compact 2D keypoint trajectory prompt specifying where to go and what to approach next. The executor VLA is adapted to condition on the rendered trace, thereby turning long-horizon decision-making into repeated local control by following the trace. Crucially, predicting the remaining plan at each step yields an implicit closed loop: failed steps persist in subsequent outputs, and traces update accordingly, enabling automatic continuation and replanning without hand-crafted recovery logic or brittle visual-history buffers. Extensive experiments spanning embodied planning, long-horizon reasoning, trajectory prediction, and end-to-end manipulation in simulation and on a real Franka robot demonstrate strong gains in long-horizon success, robustness, and out-of-distribution generalization. Project page: https://www.liuisabella.com/LoHoManip

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Dev.to

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Reddit r/LocalLLaMA

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

Key Points

Abstract

Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer