MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence

arXiv cs.RO / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MARS, a multi-agent smart-home robotic system for assistive intelligence powered by multimodal large language models (MLLMs), targeting challenges like risk-aware planning and user personalization.
  • MARS uses four specialized agents—visual perception, risk assessment, planning, and evaluation—to convert cluttered indoor environment understanding into executable, coordinated actions.
  • The framework emphasizes grounding language plans into action sequences via hierarchical multi-agent decision-making, enabling adaptive assistance in dynamic home settings.
  • Experiments on multiple datasets report improved performance over state-of-the-art multimodal models, particularly for risk-aware planning and multi-agent execution coordination.
  • The authors position the approach as a generalizable methodology for deploying collaborative, MLLM-enabled multi-agent systems in real-world assistive scenarios.

Abstract

Multimodal large language models (MLLMs) have shown remarkable capabilities in cross-modal understanding and reasoning, offering new opportunities for intelligent assistive systems, yet existing systems still struggle with risk-aware planning, user personalization, and grounding language plans into executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic System powered by MLLMs for assistive intelligence and designed for smart home robots supporting people with disabilities. The system integrates four agents: a visual perception agent for extracting semantic and spatial features from environment images, a risk assessment agent for identifying and prioritizing hazards, a planning agent for generating executable action sequences, and an evaluation agent for iterative optimization. By combining multimodal perception with hierarchical multi-agent decision-making, the framework enables adaptive, risk-aware, and personalized assistance in dynamic indoor environments. Experiments on multiple datasets demonstrate the superior overall performance of the proposed system in risk-aware planning and coordinated multi-agent execution compared with state-of-the-art multimodal models. The proposed approach also highlights the potential of collaborative AI for practical assistive scenarios and provides a generalizable methodology for deploying MLLM-enabled multi-agent systems in real-world environments.