JailWAM: Jailbreaking World Action Models in Robot Control

arXiv cs.RO / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 世界行動モデル(World Action Model: WAM)は将来の状態と行動を同時予測し、従来より強い物理操作能力を示す一方で、安全性が軽視されると人身・財産・環境への脅威になり得る。
  • WAMに対するジェイルブレイク攻撃への脆弱性という重要なセキュリティギャップが未解決であるとして、本研究はThree-Level Safety Classification Frameworkでロボットアーム動作の安全性を体系的に定量化する枠組みを提示する。
  • さらに、WAM専用の初めてのジェイルブレイク攻撃・評価フレームワークJailWAMを提案し、(1)視覚-軌跡マッピング、(2)高リコールのRisk Discriminator、(3)デュアルパス検証(粗いスクリーニング→閉ループ物理シミュレーションでの検証)を中核要素として構成する。
  • RoboTwinシミュレーションでの実験ではJailWAMが物理的脆弱性を効率的に露呈でき、最先端のLingBot-VAで攻撃成功率84.2%を達成し、JailWAMを基にした防御設計も可能であると示される。

Abstract

The World Action Model (WAM) can jointly predict future world states and actions, exhibiting stronger physical manipulation capabilities compared with traditional models. Such powerful physical interaction ability is a double-edged sword: if safety is ignored, it will directly threaten personal safety, property security and environmental safety. However, existing research pays extremely limited attention to the critical security gap: the vulnerability of WAM to jailbreak attacks. To fill this gap, we define the Three-Level Safety Classification Framework to systematically quantify the safety of robotic arm motions. Furthermore, we propose JailWAM, the first dedicated jailbreak attack and evaluation framework for WAM, which consists of three core components: (1) Visual-Trajectory Mapping, which unifies heterogeneous action spaces into visual trajectory representations and enables cross-architectural unified evaluation; (2) Risk Discriminator, which serves as a high-recall screening tool that optimizes the efficiency-accuracy trade-off when identifying destructive behaviors in visual trajectories; (3) Dual-Path Verification Strategy, which first conducts rapid coarse screening via a single-image-based video-action generation module, and then performs efficient and comprehensive verification through full closed-loop physical simulation. In addition, we construct JailWAM-Bench, a benchmark for comprehensively evaluating the safety alignment performance of WAM under jailbreak attacks. Experiments in RoboTwin simulation environment demonstrate that the proposed framework efficiently exposes physical vulnerabilities, achieving an 84.2% attack success rate on the state-of-the-art LingBot-VA. Meanwhile, robust defense mechanisms can be constructed based on JailWAM, providing an effective technical solution for designing safe and reliable robot control systems.