BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

arXiv cs.RO / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that open-vocabulary mobile manipulation systems fail in dynamic settings because they update their world representation only at discrete moments, leaving robots blind between updates.
It proposes BINDER, a dual-process framework that decouples strategic planning (via a multimodal LLM “DRM”) from continuous monitoring (via a VideoLLM “IRM”).
The DRM produces structured 3D scene updates and instructs what the IRM should focus on, while the IRM continuously analyzes video to update memory, correct actions, and trigger replanning.
By coordinating DRM and IRM bidirectionally, BINDER aims to balance maintaining situational awareness with avoiding overly costly frequent updates.
Experiments in three real-world environments with dynamically placed objects show substantially higher success and efficiency than state-of-the-art baselines.

Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

Dev.to

I scanned every major vibe coding tool for security. None scored above 90.

Dev.to

I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.

Dev.to

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

Reddit r/artificial

Give me your ideass [N]

Reddit r/MachineLearning

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Key Points

Abstract

Related Articles

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

I scanned every major vibe coding tool for security. None scored above 90.

I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

Give me your ideass [N]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer