From Instruction to Event: Sound-Triggered Mobile Manipulation

arXiv cs.RO / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that mobile manipulation research has been too focused on an instruction-driven paradigm, which restricts agents from reacting autonomously to dynamic events in the environment.
  • It introduces a new task setting called sound-triggered mobile manipulation, requiring agents to perceive and interact with sound-emitting objects without explicit action-by-action instructions.
  • To enable this, the authors develop Habitat-Echo, a data platform that combines acoustic rendering with physically grounded interaction in a simulated environment.
  • The work proposes a baseline system with a high-level task planner and low-level policy models, designed to detect auditory events and decide appropriate interactions.
  • Experiments—including a dual-source setup with overlapping acoustic interference—show the agent can identify the primary sound source, interact with it first, and then proceed to manipulate a secondary object, demonstrating robustness.

Abstract

Current mobile manipulation research predominantly follows an instruction-driven paradigm, where agents rely on predefined textual commands to execute tasks. However, this setting confines agents to a passive role, limiting their autonomy and ability to react to dynamic environmental events. To address these limitations, we introduce sound-triggered mobile manipulation, where agents must actively perceive and interact with sound-emitting objects without explicit action instructions. To support these tasks, we develop Habitat-Echo, a data platform that integrates acoustic rendering with physical interaction. We further propose a baseline comprising a high-level task planner and low-level policy models to complete these tasks. Extensive experiments show that the proposed baseline empowers agents to actively detect and respond to auditory events, eliminating the need for case-by-case instructions. Notably, in the challenging dual-source scenario, the agent successfully isolates the primary source from overlapping acoustic interference to execute the first interaction, and subsequently proceeds to manipulate the secondary object, verifying the robustness of the baseline.