GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile Manipulation

arXiv cs.RO / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces GenerativeMPC, a hierarchical cyber-physical framework that connects semantic scene understanding to physical control constraints for bimanual mobile manipulation.
  • It uses a Vision-Language Model with Retrieval-Augmented Generation (VLM-RAG) to convert visual and linguistic context into grounded MPC outputs such as dynamic velocity limits and safety margins.
  • The same VLM-RAG component adjusts virtual stiffness and damping gains for a unified impedance-admittance controller, allowing context-aware compliant behavior during human-robot interaction.
  • An experience-driven vector database is used to maintain consistent parameter grounding without retraining, improving practicality and stability.
  • Experiments in MuJoCo, IsaacSim, and on a physical bimanual platform show about a 60% speed reduction near humans and demonstrate safe, socially aware navigation and manipulation.

Abstract

Bimanual mobile manipulation requires a seamless integration between high-level semantic reasoning and safe, compliant physical interaction - a challenge that end-to-end models approach opaquely and classical controllers lack the context to address. This paper presents GenerativeMPC, a hierarchical cyber-physical framework that explicitly bridges semantic scene understanding with physical control parameters for bimanual mobile manipulators. The system utilizes a Vision-Language Model with Retrieval-Augmented Generation (VLM-RAG) to translate visual and linguistic context into grounded control constraints, specifically outputting dynamic velocity limits and safety margins for a Whole-Body Model Predictive Controller (MPC). Simultaneously, the VLM-RAG module modulates virtual stiffness and damping gains for a unified impedance-admittance controller, enabling context-aware compliance during human-robot interaction. Our framework leverages an experience-driven vector database to ensure consistent parameter grounding without retraining. Experimental results in MuJoCo, IsaacSim, and on a physical bimanual platform confirm a 60% speed reduction near humans and safe, socially-aware navigation and manipulation through semantic-to-physical parameter grounding. This work advances the field of human-centric cybernetics by grounding large-scale cognitive models into predictable, high-frequency physical control loops.