Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a dynamic LIBRAS (Brazilian Sign Language) gesture recognition approach that combines MediaPipe Hand Landmarker for extracting 21 hand skeletal keypoints with a CNN for classification.
  • Gestures are encoded as a spatiotemporal matrix of size 90×21 derived from keypoints, allowing the CNN to recognize 11 static and dynamic gesture classes.
  • For real-time continuous recognition, the method uses a sliding window and temporal frame triplication to avoid recurrent networks while still capturing temporal context.
  • Experiments report 95% accuracy in low-light conditions and 92% accuracy in normal lighting, supporting the feasibility of the approach for home automation device control.
  • The authors note that further systematic testing with a wider range of users is needed to better assess generalization performance across diverse populations.

Abstract

This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95\% accuracy under low-light conditions and 92\% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.