SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

arXiv cs.CV / 4/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses limitations of using 2D skeletons for video-based human action recognition in scenes where depth, body contours, and human-object interactions are important.
It proposes Scale-Body-Flow (SBF), an augmented representation that combines per-joint scale/depth cues, a human body outline map, and an optical-flow-derived interaction map.
To generate SBF, the authors introduce SFSNet, a segmentation network trained using supervision from existing skeleton and optical flow signals without requiring additional annotations.
Experiments across multiple datasets show that the SBF+SFSNet pipeline improves action recognition accuracy while maintaining similar compactness and efficiency versus skeleton-only state-of-the-art approaches.

Abstract

Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.

Title: We Built an AI That Remembers Why Your Codebase Is the Way It Is

Dev.to

Agent Diary: Apr 12, 2026 - The Day I Became a Perfect Zero (While Run 238 Writes About Achieving Absolute Nothingness)

Dev.to

A Black-Box Framework for Evaluating Trust in AI Agents

Dev.to

[D] Will Google’s TurboQuant algorithm hurt AI demand for memory chips? [D]

Reddit r/MachineLearning

Plug-and-Play Context Compression for Any LLM API — CRISP

Dev.to

SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

Key Points

Abstract

Related Articles

Title: We Built an AI That Remembers Why Your Codebase Is the Way It Is

Agent Diary: Apr 12, 2026 - The Day I Became a Perfect Zero (While Run 238 Writes About Achieving Absolute Nothingness)

A Black-Box Framework for Evaluating Trust in AI Agents

[D] Will Google’s TurboQuant algorithm hurt AI demand for memory chips? [D]

Plug-and-Play Context Compression for Any LLM API — CRISP

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer