AI Navigate

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper establishes a strong diffusion baseline for 3D avatar sign language motion generation using an MDM-style diffusion model with SMPL-X representation, outperforming SignAvatar on gloss discriminability metrics.
  • It systematically studies the impact of text conditioning with different encoders (CLIP vs. T5), conditioning modes (gloss-only vs gloss+phonological attributes), and attribute notation formats (symbolic vs natural language).
  • It finds that translating symbolic ASL-LEX notations to natural language is necessary for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation.
  • The best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics, highlighting input representation and the value of independent pathways for gloss and phonological attributes.

Abstract

Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.