Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

arXiv cs.RO / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses tactical deconfliction for low-altitude small UAVs as a safety-critical multi-agent problem requiring cooperative separation and operational efficiency under partial observability.
  • It proposes using fine-tuned LLMs as decision-makers, overcoming limitations of direct LLM use by generating domain-grounded, rule-consistent data via a simulation-to-language pipeline based on the BlueSky simulator.
  • A pretrained Qwen-Math-7B is fine-tuned with two parameter-efficient methods: supervised fine-tuning with LoRA and preference-based fine-tuning using LoRA plus GRPO.
  • Results from validation datasets and closed-loop simulations show supervised LoRA significantly improves decision accuracy, consistency, and separation performance, including meaningful reductions in near mid-air collision risk.
  • Preference-based tuning with GRPO can enhance coordination but shows decreased robustness when interacting with heterogeneous agent policies, indicating trade-offs for real-world deployment.

Abstract

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.