MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

arXiv cs.CL / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The MOMO framework targets industrial robots that need to be flexibly adapted by non-expert users using three interaction modalities: kinesthetic touch, natural language, and a graphical web UI.
  • It combines energy-based human-intention detection with a “tool-based LLM” approach that selects and parameterizes predefined functions for safer natural-language skill adaptation rather than generating code.
  • For motion and learning, MOMO uses Kernelized Movement Primitives (KMPs) to encode robot skills and probabilistic Virtual Fixtures to guide demonstration recording.
  • The method integrates control techniques for finishing tasks, including probabilistic guidance and ergodic control, and demonstrates voice-commanded surface finishing by generalizing adaptation from KMPs to ergodic control.
  • A validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair supports the claimed practical applicability in industrial environments.

Abstract

Industrial robot applications require increasingly flexible systems that non-expert users can easily adapt for varying tasks and environments. However, different adaptations benefit from different interaction modalities. We present an interactive framework that enables robot skill adaptation through three complementary modalities: kinesthetic touch for precise spatial corrections, natural language for high-level semantic modifications, and a graphical web interface for visualizing geometric relations and trajectories, inspecting and adjusting parameters, and editing via-points by drag-and-drop. The framework integrates five components: energy-based human-intention detection, a tool-based LLM architecture (where the LLM selects and parameterizes predefined functions rather than generating code) for safe natural language adaptation, Kernelized Movement Primitives (KMPs) for motion encoding, probabilistic Virtual Fixtures for guided demonstration recording, and ergodic control for surface finishing. We demonstrate that this tool-based LLM architecture generalizes skill adaptation from KMPs to ergodic control, enabling voice-commanded surface finishing. Validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair demonstrates the practical applicability of our approach in industrial settings.