RLDX-1 Technical Report

arXiv cs.RO / 5/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article introduces RLDX-1, a general-purpose robotic policy aimed at improving dexterous manipulation beyond what existing Vision-Language-Action (VLA) models can handle in complex real-world settings.
  • RLDX-1 is built on the Multi-Stream Action Transformer (MSAT), which integrates heterogeneous modalities using modality-specific streams and cross-modal joint self-attention.
  • The work includes system-level design choices such as synthesizing training data for rare manipulation scenarios, using specialized learning procedures for human-like manipulation, and optimizing inference for real-time deployment.
  • Empirical results indicate that RLDX-1 outperforms recent frontier VLAs (e.g., π0.5 and GR00T N1.6) across both simulation and real-world tasks requiring broader functional capabilities.
  • On ALLEX humanoid tasks, RLDX-1 reportedly achieves an 86.8% success rate versus about 40% for π0.5 and GR00T N1.6, demonstrating stronger control of a high-DoF humanoid robot under diverse demands.

Abstract

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. \pi_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while \pi_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.