LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception
arXiv cs.CV / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageIndustry & Market MovesModels & Research
Key Points
- LiteVLA-H is a compact 256M-parameter vision-language-action (VLA) model proposed for low-latency onboard drone deployment under strict compute and communication constraints.
- The system uses dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop for reactive guidance with short action-token outputs and a slower semantic mode for hazard/scene understanding and operator narration.
- The authors find that, in the edge setting, end-to-end latency is largely dominated by multimodal pre-fill rather than by the additional decoding cost of a few more tokens, motivating their scheduling approach.
- They report reactive action-token issuance at 50.65 ms (19.74 Hz) while still producing sentence-level semantic outputs at about 149.90–164.57 ms (6.08–6.67 Hz) on the same embedded platform.
- A knowledge-preserving fine-tuning recipe mixes flight data, aerial semantic data, and generic caption/VQA supervision to specialize for aerial guidance without losing descriptive competence.
Related Articles

Black Hat USA
AI Business

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.
Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage
TechCrunch

Google, Microsoft, and xAI will allow the US government to review their new AI models
The Verge