Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a plug-and-play Wavelet Feature Stream that adds explicit time-frequency dynamics of joint velocities to existing skeleton-based gait recognition models.
It converts per-joint velocity sequences into multi-scale scalograms via the continuous wavelet transform (CWT), then uses a lightweight multi-scale CNN to learn discriminative dynamic cues.
The learned dynamic descriptor is fused with the original skeleton backbone representation for classification, without changing the backbone architecture or requiring extra supervision.
Experiments on CASIA-B show consistent performance gains across strong skeleton backbones (GaitMixer, GaitFormer, GaitGraph), and the approach sets a new skeleton-based state of the art when combined with GaitMixer.
The method delivers especially large improvements under covariate shifts such as carrying bags (BG) and wearing coats (CL), indicating that explicit time-frequency modeling complements spatio-temporal encoders.

Abstract

Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.