LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes HocSLM, a hierarchical global-local skeleton-language model aimed at improving skeleton-based human action recognition by capturing long-range dependencies and richer temporal dynamics.
It introduces HGLNet with composite-topology spatial modeling and a dual-path hierarchical temporal module to collaboratively learn both global and local spatio-temporal relationships while respecting human-body structure.
To inject action semantics, the method uses a large vision-language model (VLM) to generate textual descriptions from the original RGB video, then trains a skeleton-language sequential fusion module to align skeleton features with the text in a shared semantic space.
Experiments on NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA show state-of-the-art performance, indicating improved semantic discrimination and cross-modal understanding.

Abstract

Skeleton-based human action recognition has achieved remarkable progress in recent years. However, most existing GCN-based methods rely on short-range motion topologies, which not only struggle to capture long-range joint dependencies and complex temporal dynamics but also limit cross-modal semantic alignment and understanding due to insufficient modeling of action semantics. To address these challenges, we propose a hierarchical global-local skeleton-language model (HocSLM), enabling the large action model be more representative of action semantics. First, we design a hierarchical global-local network (HGLNet) that consists of a composite-topology spatial module and a dual-path hierarchical temporal module. By synergistically integrating multi-level global and local modules, HGLNet achieves dynamically collaborative modeling at both global and local scales while preserving prior knowledge of human physical structure, significantly enhancing the model's representation of complex spatio-temporal relationships. Then, a large vision-language model (VLM) is employed to generate textual descriptions by passing the original RGB video sequences to this model, providing the rich action semantics for further training the skeleton-language model. Furthermore, we introduce a skeleton-language sequential fusion module by combining the features from HGLNet and the generated descriptions, which utilizes a skeleton-language model (SLM) for aligning skeletal spatio-temporal features and textual action descriptions precisely within a unified semantic space. The SLM model could significantly enhance the HGLNet's semantic discrimination capabilities and cross-modal understanding abilities. Extensive experiments demonstrate that the proposed HocSLM achieves the state-of-the-art performance on three mainstream benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA.