CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

arXiv cs.AI / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces CNSL-bench, the first comprehensive benchmark for evaluating multimodal large language models’ (MLLMs) understanding of Chinese National Sign Language (CNSL).
CNSL-bench is grounded in the officially standardized National Common Sign Language Dictionary to reduce ambiguity from regional or non-canonical sign variants.
It covers multiple modalities—aligned text descriptions, images, and sign language videos—and includes articulatory diversity such as air-writing, finger-spelling, and the Chinese manual alphabet.
The authors benchmark 21 up-to-date open-source and proprietary MLLMs and find they are still far behind human performance, with systematic gaps varying by modality and manual articulatory form.
Diagnostic analyses indicate that key performance limitations remain even as reasoning improves, and instruction-following robustness differs significantly across models.

Abstract

Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.