Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

arXiv cs.CV / 3/18/2026

📰 NewsModels & Research

共有:

Key Points

Fast-HaMeR introduces faster 3D hand mesh reconstruction by combining lightweight neural backbones with knowledge distillation, enabling real-time performance on low-power devices while maintaining accuracy.
The method substitutes the ViT-H backbone in HaMeR with lightweight backbones such as MobileNet, MobileViT, ConvNeXt, and ResNet to reduce model size.
It evaluates three distillation strategies—output-level, feature-level, and a hybrid—analyzing which yields the best student performance at different capacities.
The experiments show about 1.5x faster inference with only about 0.4mm accuracy loss, using roughly 35% of the original parameter count.
The work emphasizes practical deployment in VR/AR, HCI, robotics, and healthcare, and the code and models are released on GitHub.

Abstract

Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Reddit r/LocalLLaMA

Today, what hardware to get for running large-ish local models like qwen 120b ?

Reddit r/LocalLLaMA

Running mistral locally for meeting notes and it's honestly good enough for my use case

Reddit r/LocalLLaMA

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

Reddit r/MachineLearning

Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

Key Points

Abstract

Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Today, what hardware to get for running large-ish local models like qwen 120b ?

Running mistral locally for meeting notes and it's honestly good enough for my use case

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer