FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
arXiv cs.RO / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- FingerViP is a new learning system for dexterous manipulation that replaces a single wrist-mounted view with fingertip visual perception from multiple mini-cameras.
- The approach adds a vision-enhanced fingertip module (with an embedded miniature camera) on each finger to provide multi-view feedback of the hand and surrounding environment, reducing occlusion issues.
- It trains a diffusion-based whole-body visuomotor policy conditioned on a third-view camera plus multi-view fingertip vision, learning complex skills directly from human demonstrations.
- To better align visual features with proprioception and contact, FingerViP augments fingertip visual inputs with camera pose encoding and per-finger joint-current encodings.
- Experiments on challenging real-world tasks show strong robustness and adaptability, including long-horizon cabinet opening and occluded object retrieval, reaching an 80.8% overall success rate, with hardware and code planned for full open-source release.
Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com
Dev.to
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)
Reddit r/LocalLLaMA