PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

arXiv cs.RO / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • PALM is a new vision-language-action (VLA) framework designed to improve long-horizon, multi-step robotic manipulation by adding interaction-centric affordance reasoning and explicit subtask progress tracking.
  • The method distills multiple complementary affordance representations (object relevance, contact geometry, spatial placement, and motion dynamics) to serve as task-relevant anchors for visuomotor control.
  • PALM predicts continuous within-subtask progress to reduce execution failures such as repeated actions, missed steps, and premature termination, enabling smoother transitions between subtasks.
  • In experiments across extensive simulation and real-world benchmarks, PALM outperforms baselines, reaching 91.8% success on LIBERO-LONG, a 12.5% average-length improvement on CALVIN (ABC->D), and about a 2× gain over real-world baselines across three long-horizon generalization settings.

Abstract

Recent advancements in vision-language-action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8% success rate on LIBERO-LONG, a 12.5% improvement in average length on CALVIN ABC->D, and a 2x improvement over real-world baselines across three long-horizon generalization settings.