VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation
arXiv cs.RO / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- VLBiMan is a vision-language anchored robotic framework that learns generalizable bimanual manipulation skills from a single human demonstration by decomposing tasks into reusable components.
- The method keeps invariant “primitive” skills as anchors while dynamically adapting adjustable parts using vision-language grounding, avoiding policy retraining when scenes change.
- It addresses real-world scene ambiguities such as background variation, object repositioning, visual clutter, and external disturbances via semantic parsing and geometric feasibility constraints.
- Experiments show VLBiMan reduces required demonstrations versus imitation-learning baselines, supports compositional generalization through atomic skill splicing, improves robustness to novel but semantically similar objects, and transfers across different robot embodiments without retraining.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to