SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

arXiv cs.RO / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes SARM, a stage-aware, video-based reward modeling framework for long-horizon, contact-rich robot manipulation, addressing inconsistent demonstration quality in tasks like deformable-object handling.
  • SARM jointly predicts the task stage and fine-grained progress using natural-language subtask annotations, producing consistent supervision across variable-length demonstrations and avoiding brittleness from frame-index-based labeling.
  • The reward model is reported to be robust to demonstration variability and to generalize to out-of-distribution settings, leading to improved downstream policy training.
  • The authors further introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations using reward estimates, and experiments claim strong gains in real-world rollouts and human validation.
  • For T-shirt folding, the method reportedly achieves 83% success from the flattened state and 67% from the crumpled state, versus 8% and 0% for vanilla behavior cloning, supporting reward modeling as a scalable, annotation-efficient approach for long-horizon robotics.

Abstract

Large-scale robot learning has made progress on complex manipulation tasks, yet long horizon, contact rich problems, especially those involving deformable objects, remain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame index based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83% success from the flattened state and 67% from the crumpled state, compared to 8% and 0% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long horizon robotic manipulation. Project website: https://qianzhong-chen.github.io/sarm.github.io/