InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
arXiv cs.RO / 4/28/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces InCoM, a mobile-manipulation framework that combines intent-driven perception with structured coordination to handle changing viewpoints and the need for coordinated base-arm control.
- InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, allowing the robot to allocate visual attention in a stage-adaptive way during manipulation.
- To improve robustness across modalities, the method adds a geometric-semantic structured alignment mechanism that strengthens correspondence between different sensory inputs.
- On the control side, it uses a decoupled coordinated flow-matching action decoder that explicitly models coordinated base and arm actions, reducing optimization issues caused by strong coupling.
- Experiments show InCoM outperforming state-of-the-art approaches, improving success rates by 28.2%, 26.1%, and 23.6% across three ManiSkill-HAB scenarios without privileged information, and also performing better in real-world tasks.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to