InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation

arXiv cs.RO / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces InCoM, a mobile-manipulation framework that combines intent-driven perception with structured coordination to handle changing viewpoints and the need for coordinated base-arm control.
  • InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, allowing the robot to allocate visual attention in a stage-adaptive way during manipulation.
  • To improve robustness across modalities, the method adds a geometric-semantic structured alignment mechanism that strengthens correspondence between different sensory inputs.
  • On the control side, it uses a decoupled coordinated flow-matching action decoder that explicitly models coordinated base and arm actions, reducing optimization issues caused by strong coupling.
  • Experiments show InCoM outperforming state-of-the-art approaches, improving success rates by 28.2%, 26.1%, and 23.6% across three ManiSkill-HAB scenarios without privileged information, and also performing better in real-world tasks.

Abstract

Mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Experimental results demonstrate that InCoM significantly outperforms state-of-the-art methods, achieving success rate gains of 28.2%, 26.1%, and 23.6% across three ManiSkill-HAB scenarios without privileged information. Furthermore, its effectiveness is consistently validated in real-world mobile manipulation tasks, where InCoM maintains a superior success rate over existing baselines.