FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception

arXiv cs.RO / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • FingerViP is a new learning system for dexterous manipulation that replaces a single wrist-mounted view with fingertip visual perception from multiple mini-cameras.
  • The approach adds a vision-enhanced fingertip module (with an embedded miniature camera) on each finger to provide multi-view feedback of the hand and surrounding environment, reducing occlusion issues.
  • It trains a diffusion-based whole-body visuomotor policy conditioned on a third-view camera plus multi-view fingertip vision, learning complex skills directly from human demonstrations.
  • To better align visual features with proprioception and contact, FingerViP augments fingertip visual inputs with camera pose encoding and per-finger joint-current encodings.
  • Experiments on challenging real-world tasks show strong robustness and adaptability, including long-horizon cabinet opening and occluded object retrieval, reaching an 80.8% overall success rate, with hardware and code planned for full open-source release.

Abstract

The current practice of dexterous manipulation generally relies on a single wrist-mounted view, which is often occluded and limits performance on tasks requiring multi-view perception. In this work, we present FingerViP, a learning system that utilizes a visuomotor policy with fingertip visual perception for dexterous manipulation. Specifically, we design a vision-enhanced fingertip module with an embedded miniature camera and install the modules on each finger of a multi-fingered hand. The fingertip cameras substantially improve visual perception by providing comprehensive, multi-view feedback of both the hand and its surrounding environment. Building on the integrated fingertip modules, we develop a diffusion-based whole-body visuomotor policy conditioned on a third-view camera and multi-view fingertip vision, which effectively learns complex manipulation skills directly from human demonstrations. To improve view-proprioception alignment and contact awareness, each fingertip visual feature is augmented with its corresponding camera pose encoding and per-finger joint-current encoding. We validate the effectiveness of the multi-view fingertip vision and demonstrate the robustness and adaptability of FingerViP on various challenging real-world tasks, including pressing buttons inside a confined box, retrieving sticks from an unstable support, retrieving objects behind an occluding curtain, and performing long-horizon cabinet opening and object retrieval, achieving an overall success rate of 80.8%. All hardware designs and code will be fully open-sourced.