LLM Enhanced Action Recognition via Hierarchical Global-Local Skeleton-Language Model
arXiv cs.CV / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes HocSLM, a hierarchical global-local skeleton-language model aimed at improving skeleton-based human action recognition by capturing long-range dependencies and richer temporal dynamics.
- It introduces HGLNet with composite-topology spatial modeling and a dual-path hierarchical temporal module to collaboratively learn both global and local spatio-temporal relationships while respecting human-body structure.
- To inject action semantics, the method uses a large vision-language model (VLM) to generate textual descriptions from the original RGB video, then trains a skeleton-language sequential fusion module to align skeleton features with the text in a shared semantic space.
- Experiments on NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA show state-of-the-art performance, indicating improved semantic discrimination and cross-modal understanding.
Related Articles

Black Hat Asia
AI Business

Anthropic's Accidental Release of Claude Code's Source Code: Irretrievable and Publicly Accessible
Dev.to

Salesforce announces an AI-heavy makeover for Slack, with 30 new features
TechCrunch

Oracle’s Impersonal Mass Layoffs: Thousands Impacted in AI-Driven Cost Cuts
Dev.to

Claude Code's Compaction Engine: What the Source Code Actually Reveals
Dev.to