HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Vision-Language-Action (VLA) models still struggle with long-horizon manipulation even when they perform well on short-horizon tasks, and simply increasing context length does not fix the issue under reactive execution.
  • The paper identifies three recurring execution-loop problems—memory gap, verification gap, and recovery gap—as the root causes of systematic long-horizon failures.
  • It introduces HELM, a model-agnostic framework that combines an Episodic Memory Module (retrieving CLIP-indexed keyframes), a learned State Verifier to predict action failure before execution, and a Harness Controller that performs rollback and replanning.
  • The learned State Verifier is the key contributor, outperforming rule-based checks and ensemble uncertainty baselines, with performance depending critically on episodic memory access.
  • Experiments show large gains on LIBERO-LONG (task success rises by 23.1 points over OpenVLA, while context extension to H=32 provides only a 5.4-point improvement), plus improvements on CALVIN and a released LIBERO-Recovery protocol for evaluating failure recovery.

Abstract

Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble uncertainty baselines, and its effectiveness depends critically on access to episodic memory. On LIBERO-LONG, HELM improves task success rate by 23.1 percentage points over OpenVLA (58.4% to 81.5%), while extending the context window to H=32 yields only a 5.4-point gain and same-budget LoRA adaptation remains 12.2 points below HELM. HELM also improves long-horizon performance on CALVIN and substantially boosts recovery success under controlled perturbations. Ablations and mechanism analyses isolate the contribution of each component, and we release LIBERO-Recovery as a perturbation-injection protocol for evaluating failure recovery in long-horizon manipulation.