Swim2Real: VLM-Guided System Identification for Sim-to-Real Transfer

arXiv cs.RO / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Swim2Real is a video-to-simulator calibration pipeline that uses vision-language model (VLM) feedback to tune a 16-parameter robotic fish simulator without hand-designed search stages.
  • It addresses hard sim-to-real issues in aquatic robotics—chaotic parameter landscapes, persistent sim model error, and limited reproducible experiments—by comparing simulated and real swim videos and iteratively updating parameters.
  • A backtracking line search validates VLM-proposed step sizes, boosting acceptance rate from 14% to 42% by correcting cases where the update direction is right but the magnitude is too large.
  • The calibrated simulator closely matches real fish velocities across motor frequencies (MAE 7.4 mm/s, 43% lower than the next-best method) and maintains robustness with zero outlier seeds across five runs.
  • With the tuned simulator, motor commands transfer to a physical fish at 50 Hz, and downstream RL policies achieve improved performance versus policies trained on simulators calibrated with BayesOpt or CMA-ES.

Abstract

We present Swim2Real, a pipeline that calibrates a 16-parameter robotic fish simulator from swimming videos using vision-language model (VLM) feedback, requiring no hand-designed search stages. Calibrating soft aquatic robots is particularly challenging because nonlinear fluid-structure coupling makes the parameter landscape chaotic, simplified fluid models introduce a persistent sim-to-real gap, and controlled aquatic experiments are difficult to reproduce. Prior work on this platform required three manually tailored stages to handle this complexity. The VLM compares simulated and real videos and proposes parameter updates. A backtracking line search then validates each step size, tripling the accept rate from 14% to 42% by recovering proposals where the direction is correct but the magnitude is too large. Swim2Real calibrates all 16 parameters simultaneously, most closely matching real fish velocities across all motor frequencies (MAE = 7.4 mm/s, 43% lower than the next-best method), with zero outlier seeds across five runs. Motor commands from the trained policy transfer to the physical fish at 50 Hz, completing the pipeline from swimming video to real-world deployment. Downstream RL policies swim 12% farther than those from BayesOpt-calibrated simulators and 90% farther than CMA-ES. These results demonstrate that VLM-guided calibration can close the sim-to-real gap for aquatic robots directly from video, enabling zero-shot RL transfer to physical swimmers without manual system identification, a step toward automated, general-purpose simulator tuning for underwater robotics.