HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
arXiv cs.AI / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- Vision-Language-Action (VLA) models still struggle with long-horizon manipulation even when they perform well on short-horizon tasks, and simply increasing context length does not fix the issue under reactive execution.
- The paper identifies three recurring execution-loop problems—memory gap, verification gap, and recovery gap—as the root causes of systematic long-horizon failures.
- It introduces HELM, a model-agnostic framework that combines an Episodic Memory Module (retrieving CLIP-indexed keyframes), a learned State Verifier to predict action failure before execution, and a Harness Controller that performs rollback and replanning.
- The learned State Verifier is the key contributor, outperforming rule-based checks and ensemble uncertainty baselines, with performance depending critically on episodic memory access.
- Experiments show large gains on LIBERO-LONG (task success rises by 23.1 points over OpenVLA, while context extension to H=32 provides only a 5.4-point improvement), plus improvements on CALVIN and a released LIBERO-Recovery protocol for evaluating failure recovery.
Related Articles

Autoencoders and Representation Learning in Vision
Dev.to
Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.
Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful
Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge