AI Navigate

Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a gesture-aware pretraining framework for 3D hand pose estimation from monocular RGB images, leveraging gesture labels to provide a useful inductive bias.
  • It presents a two-stage pipeline consisting of gesture-aware pretraining to learn an informative embedding space from coarse and fine gesture labels, followed by a per-joint token Transformer that uses gesture embeddings to regress MANO hand parameters.
  • The training objective is layered, supervising parameters, joints, and structural constraints to guide learning.
  • Experiments on InterHand2.6M show that gesture-aware pretraining improves single-hand accuracy over the prior EANet baseline and generalizes across architectures without modification.

Abstract

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.