Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation

arXiv cs.CV / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces TSHaMo, a teacher-student diffusion framework that generates realistic 3D hand motions from natural language text without requiring 3D meshes at inference time.
  • The teacher uses structured auxiliary signals such as MANO parameters to guide training, while the student ultimately learns to generate motion using text-only inputs.
  • A co-training strategy lets the student benefit from the teacher’s intermediate predictions, aiming to improve both motion quality and diversity.
  • Experiments on the GRAB and H2O datasets, using two diffusion backbones, show consistent improvements over prior approaches, with ablations demonstrating robustness to different auxiliary inputs.
  • The method is described as model-agnostic and flexible, enabling integration of varied training-time auxiliary signals while preserving text-only deployment.

Abstract

Generating realistic 3D hand motion from natural language is vital for VR, robotics, and human-computer interaction. Existing methods either focus on full-body motion, overlooking detailed hand gestures, or require explicit 3D object meshes, limiting generality. We propose TSHaMo, a model-agnostic teacher-student diffusion framework for text-driven hand motion generation. The student model learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training. A co-training strategy enables the student to benefit from the teacher's intermediate predictions while remaining text-only at inference. Evaluated using two diffusion backbones on GRAB and H2O, TSHaMo consistently improves motion quality and diversity. Ablations confirm its robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.

Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation | AI Navigate