From Instruction to Event: Sound-Triggered Mobile Manipulation

arXiv cs.RO / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that mobile manipulation research has been too focused on an instruction-driven paradigm, which restricts agents from reacting autonomously to dynamic events in the environment.
It introduces a new task setting called sound-triggered mobile manipulation, requiring agents to perceive and interact with sound-emitting objects without explicit action-by-action instructions.
To enable this, the authors develop Habitat-Echo, a data platform that combines acoustic rendering with physically grounded interaction in a simulated environment.
The work proposes a baseline system with a high-level task planner and low-level policy models, designed to detect auditory events and decide appropriate interactions.
Experiments—including a dual-source setup with overlapping acoustic interference—show the agent can identify the primary sound source, interact with it first, and then proceed to manipulate a secondary object, demonstrating robustness.

Abstract

Current mobile manipulation research predominantly follows an instruction-driven paradigm, where agents rely on predefined textual commands to execute tasks. However, this setting confines agents to a passive role, limiting their autonomy and ability to react to dynamic environmental events. To address these limitations, we introduce sound-triggered mobile manipulation, where agents must actively perceive and interact with sound-emitting objects without explicit action instructions. To support these tasks, we develop Habitat-Echo, a data platform that integrates acoustic rendering with physical interaction. We further propose a baseline comprising a high-level task planner and low-level policy models to complete these tasks. Extensive experiments show that the proposed baseline empowers agents to actively detect and respond to auditory events, eliminating the need for case-by-case instructions. Notably, in the challenging dual-source scenario, the agent successfully isolates the primary source from overlapping acoustic interference to execute the first interaction, and subsequently proceeds to manipulate the secondary object, verifying the robustness of the baseline.

Meta Pivots From Open Weights, Big Pharma Bets On AI, Regulatory Patchwork, Simulating Human Cohorts

The Batch

Introducing Claude Design by Anthropic LabsToday, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more.

Anthropic News

Why Claude Ignores Your Instructions (And How to Fix It With CLAUDE.md)

Dev.to

Latent Multi-task Architecture Learning

Dev.to

Generative Simulation Benchmarking for circular manufacturing supply chains with zero-trust governance guarantees

Dev.to

From Instruction to Event: Sound-Triggered Mobile Manipulation

Key Points

Abstract

Related Articles

Meta Pivots From Open Weights, Big Pharma Bets On AI, Regulatory Patchwork, Simulating Human Cohorts

Introducing Claude Design by Anthropic LabsToday, we’re launching Claude Design, a new Anthropic Labs product that lets you collaborate with Claude to create polished visual work like designs, prototypes, slides, one-pagers, and more.

Why Claude Ignores Your Instructions (And How to Fix It With CLAUDE.md)

Latent Multi-task Architecture Learning

Generative Simulation Benchmarking for circular manufacturing supply chains with zero-trust governance guarantees

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer