$\pi$, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation

arXiv cs.RO / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes AirVLA to transfer vision-language-action (VLA) foundation models from fixed-base robotic arms to aerial pick-and-place, addressing the core “dynamics gap” caused by flight’s underactuated, highly dynamic behavior.
Experiments show that visual representation transfer works, but the flight-specific control dynamics do not transfer directly to aerial manipulation.
To avoid retraining the foundation model, the authors introduce Payload-Aware Guidance that injects payload/constraint information into the policy’s flow-matching sampling process at inference time.
To mitigate limited aerial training data, they synthesize navigation data using a Gaussian Splatting pipeline and report that this synthetic data is a key enabler of performance.
Across 460 real-world experiments, AirVLA improves navigation success to 100% (vs. 81% with teleoperation-only fine-tuning), raises pick-and-place success from 23% to 50% via inference-time guidance, and achieves 62% on a long-horizon compositional task.

Abstract

Vision-Language-Action (VLA) models such as

\pi_0

have demonstrated remarkable generalization across diverse fixed-base manipulators. However, transferring these foundation models to aerial platforms remains an open challenge due to the fundamental mismatch between the quasi-static dynamics of fixed-base arms and the underactuated, highly dynamic nature of flight. In this work, we introduce AirVLA, a system that investigates the transferability of manipulation-pretrained VLAs to aerial pick-and-place tasks. We find that while visual representations transfer effectively, the specific control dynamics required for flight do not. To bridge this "dynamics gap" without retraining the foundation model, we introduce a Payload-Aware Guidance mechanism that injects payload constraints directly into the policy's flow-matching sampling process. To overcome data scarcity, we further utilize a Gaussian Splatting pipeline to synthesize navigation training data. We evaluate our method through a cumulative 460 real-world experiments which demonstrate that this synthetic data is a key enabler of performance, unlocking 100% success in navigation tasks where directly fine-tuning on teleoperation data alone attains 81% success. Our inference-time intervention, Payload-Aware Guidance, increases real-world pick-and-place task success from 23% to 50%. Finally, we evaluate the model on a long-horizon compositional task, achieving a 62% overall success rate. These results suggest that pre-trained manipulation VLAs, with appropriate data augmentation and physics-informed guidance, can transfer to aerial manipulation and navigation, as well as the composition of these tasks.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Sector HQ Daily AI Intelligence - March 27, 2026

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

$\pi$, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Sector HQ Daily AI Intelligence - March 27, 2026

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer