Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper establishes a strong diffusion baseline for 3D avatar sign language motion generation using an MDM-style diffusion model with SMPL-X representation, outperforming SignAvatar on gloss discriminability metrics.
It systematically studies the impact of text conditioning with different encoders (CLIP vs. T5), conditioning modes (gloss-only vs gloss+phonological attributes), and attribute notation formats (symbolic vs natural language).
It finds that translating symbolic ASL-LEX notations to natural language is necessary for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation.
The best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics, highlighting input representation and the value of independent pathways for gloss and phonological attributes.

Abstract

Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

Dev.to

Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Key Points

Abstract

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer