PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

arXiv cs.AI / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • PilotBench is introduced as a new benchmark to test whether LLM-based agents can predict safety-critical flight trajectory and attitude while respecting explicit safety constraints.
  • The benchmark is built from 708 real-world general aviation trajectories across nine distinct flight phases, using synchronized 34-channel telemetry to evaluate both semantic reasoning and physics-governed prediction.
  • A new composite metric, Pilot-Score, combines 60% regression accuracy with 40% instruction adherence and safety compliance to measure performance in a balanced way.
  • Across 41 evaluated models, traditional forecasters show better numeric precision (lower MAE), while LLMs demonstrate higher instruction-following/controllability but with a precision tradeoff, revealing a “Precision-Controllability Dichotomy.”
  • Phase-stratified results show LLM performance degrades sharply in high-workload phases (e.g., Climb and Approach), motivating hybrid systems that pair LLM symbolic reasoning with specialized numerical forecasters.

Abstract

As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs' symbolic reasoning with specialized forecasters' numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.