Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

arXiv cs.RO / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to achieve better Sim-to-Real generalization for dexterous manipulation when using synthetic data instead of expensive real-world collection.
  • It empirically evaluates four main determinants—multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning update strategies—to identify what most influences transfer performance.
  • The authors introduce a comprehensive evaluation protocol that measures real-world task performance while systematically varying background, lighting, distractors, object types, and spatial features.
  • Experiments across more than 10,000 real-world trials yield actionable insights on which simulation ingredients drive stronger generalist policy transfer, with explicit relevance to Vision-Language-Action (VLA) models.
  • To enable reproducibility and standardized benchmarking, the study releases the robotic platforms and the evaluation protocol for public use.

Abstract

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.