AI Navigate

UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery

arXiv cs.LG / 3/12/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a multi-agent reinforcement learning framework for coordinating UAV fleets in stochastic medical delivery scenarios, formulated as a partially observable Markov decision process (POMDP).
  • It uses Proximal Policy Optimization (PPO) as the primary learning algorithm and evaluates variants including asynchronous extensions and classical actor–critic methods to analyze scalability and trade-offs.
  • The framework provides a real-time decision-support layer that prioritizes medical tasks and reallocates UAV resources to assist healthcare personnel in urgent logistics.
  • Evaluations leverage real-world geographic data from OpenStreetMap, showing PPO achieves superior coordination performance compared to other learning strategies.

Abstract

Unmanned aerial vehicles (UAVs) are increasingly used to support time-critical medical supply delivery, providing rapid and flexible logistics during emergencies and resource shortages. However, effective deployment of UAV fleets requires coordination mechanisms capable of prioritizing medical requests, allocating limited aerial resources, and adapting delivery schedules under uncertain operational conditions. This paper presents a multi-agent reinforcement learning (MARL) framework for coordinating UAV fleets in stochastic medical delivery scenarios where requests vary in urgency, location, and delivery deadlines. The problem is formulated as a partially observable Markov decision process (POMDP) in which UAV agents maintain awareness of medical delivery demands while having limited visibility of other agents due to communication and localization constraints. The proposed framework employs Proximal Policy Optimization (PPO) as the primary learning algorithm and evaluates several variants, including asynchronous extensions, classical actor--critic methods, and architectural modifications to analyze scalability and performance trade-offs. The model is evaluated using real-world geographic data from selected clinics and hospitals extracted from the OpenStreetMap dataset. The framework provides a decision-support layer that prioritizes medical tasks, reallocates UAV resources in real time, and assists healthcare personnel in managing urgent logistics. Experimental results show that classical PPO achieves superior coordination performance compared to asynchronous and sequential learning strategies, highlighting the potential of reinforcement learning for adaptive and scalable UAV-assisted healthcare logistics.