BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

arXiv cs.RO / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that open-vocabulary mobile manipulation systems fail in dynamic settings because they update their world representation only at discrete moments, leaving robots blind between updates.
  • It proposes BINDER, a dual-process framework that decouples strategic planning (via a multimodal LLM “DRM”) from continuous monitoring (via a VideoLLM “IRM”).
  • The DRM produces structured 3D scene updates and instructs what the IRM should focus on, while the IRM continuously analyzes video to update memory, correct actions, and trigger replanning.
  • By coordinating DRM and IRM bidirectionally, BINDER aims to balance maintaining situational awareness with avoiding overly costly frequent updates.
  • Experiments in three real-world environments with dynamically placed objects show substantially higher success and efficiency than state-of-the-art baselines.

Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.