RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

arXiv cs.RO / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • RoboAlign-R1 is an arXiv-announced framework for improving robot video world models by aligning training objectives with robot-relevant goals such as instruction following, manipulation success, and physical plausibility.
  • The work introduces RobotWorldBench (10,000 annotated video–instruction pairs from four robot data sources) and RoboAlign-Judge, a multimodal teacher judge that provides fine-grained six-dimensional evaluation of generated videos.
  • RoboAlign-R1 uses teacher–student distillation to convert the teacher judge into a lightweight student reward model, enabling efficient reinforcement-learning-based post-training.
  • To address long-horizon autoregressive error accumulation, it proposes Sliding Window Re-encoding (SWR), a training-free inference method that refreshes generation context and reduces rollout drift.
  • Reported in-domain results show a 10.1% improvement in the aggregate six-dimension score over the strongest baseline (including 7.5% on manipulation accuracy and 4.6% on instruction following), while SWR yields a 2.8% SSIM gain and a 9.8% LPIPS reduction with ~1% additional latency.

Abstract

Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.