PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
arXiv cs.RO / 4/23/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- PokeVLA is a lightweight vision-language-action (VLA) foundation model designed to improve embodied robotic manipulation by adding better spatial awareness and high-level world knowledge.
- The approach uses a two-stage training process: pretraining a compact vision-language model (PokeVLM) on 2.4M multimodal samples, then aligning manipulation-relevant representations into the action space.
- PokeVLA specifically incorporates multi-view goal-aware semantics learning, geometry alignment, and a new “action expert” module to improve how actions are chosen.
- Experiments report state-of-the-art results on the LIBERO-Plus benchmark and strong real-world deployment performance, with higher success rates and robustness to various perturbations.
- The authors plan to support reproducibility and community adoption by open-sourcing the code, model weights, and scripts for the curated pre-training dataset.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Elevating Austria: Google invests in its first data center in the Alps.
Google Blog

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

AI Tutor That Works Offline — Study Anywhere with EaseLearn AI
Dev.to