Targeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ASL-MTP (American Sign Language Minimal Translation Pairs), a new benchmark dataset designed to test whether sign language models capture specific linguistic phenomena using minimal translation pairs.
  • Using ASL-MTP, the authors perform a targeted analysis of a state-of-the-art ASL-to-English translation model by ablating different input cues during both training and inference.
  • The findings indicate that the model performs above chance on most linguistic phenomena, but it depends heavily on manual (hand-related) cues.
  • The model frequently fails to capture or use crucial non-manual cues, such as those involving the upper body and facial expressions.
  • Overall, the benchmark and analysis approach provide a more precise way to evaluate multimodal understanding in sign language models beyond generic translation/recognition metrics.

Abstract

Models of sign language have historically lagged behind those for spoken language (text and speech). Recent work has greatly improved their performance on tasks like sign language translation and isolated sign recognition. However, it remains unclear to what extent existing models capture various linguistic phenomena of sign language, and how well they use cues from the multiple articulators used in sign language (hands, upper body, face). We introduce a new benchmark dataset for American Sign Language, ASL Minimal Translation Pairs (ASL-MTP), divided into multiple types of sign language phenomena and corresponding minimal pairs of translations, for performing such linguistic analyses. As a case study, we use ASL-MTP to analyze a state-of-the-art ASL-to-English translation model. We conduct a targeted analysis of the model by ablating various input cues during training and inference and evaluating on the phenomena in ASL-MTP. Our results show that, while the model performs above chance level on most of the phenomena, it relies strongly on manual cues while often missing crucial non-manual cues.