Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

arXiv cs.RO / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper surveys aerial vision-and-language navigation (Aerial VLN), focusing on how UAVs can ground natural-language instructions in visual perception to navigate complex 3D environments.
  • It formalizes the Aerial VLN problem and distinguishes two interaction paradigms—single-instruction and dialog-based navigation—as key axes for the field.
  • It classifies existing approaches into five architectural categories (sequence-to-sequence/attention, end-to-end LLM/VLM, hierarchical, multi-agent, and dialog-based) and compares their design rationales, trade-offs, and performance.
  • The survey evaluates the ecosystem for Aerial VLN research, analyzing limitations in datasets, simulation platforms, and metrics—particularly regarding scale, environmental diversity, real-world grounding, and metric coverage.
  • It synthesizes seven major open problems (e.g., long-horizon grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF control, onboard deployment, benchmark standardization, and multi-UAV swarm navigation).

Abstract

Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze design rationales, technical trade-offs, and reported performance. We critically assess the evaluation infrastructure for Aerial VLN, including datasets, simulation platforms, and metrics, and identify their gaps in scale, environmental diversity, real-world grounding, and metric coverage. We consolidate cross-method comparisons on shared benchmarks and analyze key architectural trade-offs, including discrete versus continuous actions, end-to-end versus hierarchical designs, and the simulation-to-reality gap. Finally, we synthesize seven concrete open problems: long-horizon instruction grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF action execution, onboard deployment, benchmark standardization, and multi-UAV swarm navigation, with specific research directions grounded in the evidence presented throughout the survey.