Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation
arXiv cs.CV / 3/19/2026
📰 NewsModels & Research
Key Points
- The paper proposes a gesture-aware pretraining framework for 3D hand pose estimation from monocular RGB images, leveraging gesture labels to provide a useful inductive bias.
- It presents a two-stage pipeline consisting of gesture-aware pretraining to learn an informative embedding space from coarse and fine gesture labels, followed by a per-joint token Transformer that uses gesture embeddings to regress MANO hand parameters.
- The training objective is layered, supervising parameters, joints, and structural constraints to guide learning.
- Experiments on InterHand2.6M show that gesture-aware pretraining improves single-hand accuracy over the prior EANet baseline and generalizes across architectures without modification.



