HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Vision-Language-Action (VLA) models still struggle with long-horizon manipulation even when they perform well on short-horizon tasks, and simply increasing context length does not fix the issue under reactive execution.
The paper identifies three recurring execution-loop problems—memory gap, verification gap, and recovery gap—as the root causes of systematic long-horizon failures.
It introduces HELM, a model-agnostic framework that combines an Episodic Memory Module (retrieving CLIP-indexed keyframes), a learned State Verifier to predict action failure before execution, and a Harness Controller that performs rollback and replanning.
The learned State Verifier is the key contributor, outperforming rule-based checks and ensemble uncertainty baselines, with performance depending critically on episodic memory access.
Experiments show large gains on LIBERO-LONG (task success rises by 23.1 points over OpenVLA, while context extension to H=32 provides only a 5.4-point improvement), plus improvements on CALVIN and a released LIBERO-Recovery protocol for evaluating failure recovery.

Abstract

Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble uncertainty baselines, and its effectiveness depends critically on access to episodic memory. On LIBERO-LONG, HELM improves task success rate by 23.1 percentage points over OpenVLA (58.4% to 81.5%), while extending the context window to H=32 yields only a 5.4-point gain and same-budget LoRA adaptation remains 12.2 points below HELM. HELM also improves long-horizon performance on CALVIN and substantially boosts recovery success under controlled perturbations. Ablations and mechanism analyses isolate the contribution of each component, and we release LIBERO-Recovery as a perturbation-injection protocol for evaluating failure recovery in long-horizon manipulation.

Autoencoders and Representation Learning in Vision

Dev.to

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer