SignDATA: Data Pipeline for Sign Language Translation

arXiv cs.CL / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • SignDATA addresses the difficulty of consistently preprocessing sign-language datasets by standardizing inputs that differ in annotation schemas, timing, framing, and privacy constraints.
  • The config-driven toolkit provides two end-to-end pipelines (pose-based and video-based) to convert raw sign-language videos into training-ready pose artifacts or signer-cropped video packages.
  • It supports interchangeable MediaPipe and MMPose backends via a common interface, using typed job schemas, experiment-level overrides, and per-stage checkpointing.
  • The approach emphasizes reproducibility and explicit control over normalization policies and privacy tradeoffs, validated through backend comparisons and preprocessing ablation experiments.
  • The authors release the code publicly, aiming to make sign-language research preprocessing more fragment-free and empirically comparable across studies.

Abstract

Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically comparable.Code is available at https://github.com/balaboom123/signdata-slt.