SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

arXiv cs.RO / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes SARM, a stage-aware, video-based reward modeling framework for long-horizon, contact-rich robot manipulation, addressing inconsistent demonstration quality in tasks like deformable-object handling.
SARM jointly predicts the task stage and fine-grained progress using natural-language subtask annotations, producing consistent supervision across variable-length demonstrations and avoiding brittleness from frame-index-based labeling.
The reward model is reported to be robust to demonstration variability and to generalize to out-of-distribution settings, leading to improved downstream policy training.
The authors further introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations using reward estimates, and experiments claim strong gains in real-world rollouts and human validation.
For T-shirt folding, the method reportedly achieves 83% success from the flattened state and 67% from the crumpled state, versus 8% and 0% for vanilla behavior cloning, supporting reward modeling as a scalable, annotation-efficient approach for long-horizon robotics.

Abstract

Large-scale robot learning has made progress on complex manipulation tasks, yet long horizon, contact rich problems, especially those involving deformable objects, remain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame index based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83% success from the flattened state and 67% from the crumpled state, compared to 8% and 0% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long horizon robotic manipulation. Project website: https://qianzhong-chen.github.io/sarm.github.io/

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

Dev.to

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Key Points

Abstract

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer