ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

arXiv cs.AI / 4/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ReCAPA, a Predictive Alignment and Planning Architecture for vision-language-action (VLA) systems that aims to prevent local step errors from cascading across long multi-step tasks.
ReCAPA corrects deviations at three hierarchical levels—actions, subgoals, and trajectories—using prediction/contrast plus semantic alignment modules including a Sinkhorn-based component and a Score-field module.
The predictive correction and alignment are integrated into training to update the action generator, helping it keep fine-grained steps aligned with the overall intent.
The authors propose two new metrics to measure how errors propagate and recover over long-horizon execution, focusing on how mistakes spread and diminish.
Experiments on embodied-agent benchmarks (VisualAgentBench, MineDojo, AI2-THOR) report competitive results and outperform strong LLM baselines, including both proprietary and open-source systems.

Abstract

Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.