Long-Horizon Manipulation via Trace-Conditioned VLA Planning
arXiv cs.RO / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- LoHo-Manip is a modular vision-language-action (VLA) framework that extends short-horizon manipulation policies to long-horizon, multi-step instruction following by adding a dedicated task-management VLM.
- The manager and executor are decoupled: the manager predicts a progress-aware remaining plan at each step using a lightweight language memory (done + remaining) and a visual trace (a 2D keypoint trajectory prompt).
- The executor VLA is adapted to condition on the rendered trace, converting long-horizon planning into repeated local control by following the trace step-by-step.
- By re-planning from updated traces every step, LoHo-Manip forms an implicit closed loop where failures persist into later outputs and continuation/replanning occurs automatically without hand-crafted recovery or brittle visual-history buffers.
- Experiments in simulation and on a real Franka robot show strong improvements in long-horizon success, robustness, and out-of-distribution generalization across multiple embodied planning and manipulation settings.
Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com
Dev.to
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)
Reddit r/LocalLLaMA