STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

arXiv cs.CL / 3/18/2026

💬 OpinionModels & Research

共有:

Key Points

The paper proposes a unified spatio-temporal attention network for continuous sign language recognition that attends both across keypoints (spatial) and within local temporal windows (temporal) to build context-aware representations.
It achieves roughly 70-80% fewer encoder/decoder parameters than current state-of-the-art models while delivering comparable accuracy on the Phoenix-14T dataset.
The approach integrates spatial-keypoint interactions and local temporal dynamics into a single architecture, aiming to reduce model size without sacrificing performance.
Evaluations on Phoenix-14T demonstrate competitive performance, indicating potential practical benefits for CSLR systems in terms of efficiency and deployment.

Abstract

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately

70-80\%

fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Reddit r/MachineLearning

Experiment: How far can a 28M model go in business email generation?

Reddit r/LocalLLaMA

Qwen 3.5 397b (180gb) scores 93% on MMLU

Reddit r/LocalLLaMA

Qwen 3.5 27B - quantize KV cache or not?

Reddit r/LocalLLaMA

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Key Points

Abstract

Related Articles

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Experiment: How far can a 28M model go in business email generation?

Qwen 3.5 397b (180gb) scores 93% on MMLU

Qwen 3.5 27B - quantize KV cache or not?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer