PVI: Plug-in Visual Injection for Vision-Language-Action Models
arXiv cs.CV / 3/16/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- PVI is a lightweight, encoder-agnostic plug-in module that attaches to a pretrained vision-language-action policy and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning.
- The study finds that temporal video features (V-JEPA2) outperform static image features (DINOv2), with the largest gains on multi-phase tasks that require state tracking and coordination.
- PVI delivers consistent gains over the base policy and across a range of injection strategies, demonstrating its effectiveness compared with alternative approaches.
- Real-robot experiments on long-horizon bimanual cloth folding validate PVI's practicality beyond simulation and its potential for real-world robotics applications.




