Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
arXiv cs.RO / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper surveys aerial vision-and-language navigation (Aerial VLN), focusing on how UAVs can ground natural-language instructions in visual perception to navigate complex 3D environments.
- It formalizes the Aerial VLN problem and distinguishes two interaction paradigms—single-instruction and dialog-based navigation—as key axes for the field.
- It classifies existing approaches into five architectural categories (sequence-to-sequence/attention, end-to-end LLM/VLM, hierarchical, multi-agent, and dialog-based) and compares their design rationales, trade-offs, and performance.
- The survey evaluates the ecosystem for Aerial VLN research, analyzing limitations in datasets, simulation platforms, and metrics—particularly regarding scale, environmental diversity, real-world grounding, and metric coverage.
- It synthesizes seven major open problems (e.g., long-horizon grounding, viewpoint robustness, scalable spatial representation, continuous 6-DoF control, onboard deployment, benchmark standardization, and multi-UAV swarm navigation).



