B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition

arXiv cs.CV / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces B-MoE, a body-part-aware Mixture-of-Experts framework aimed at improving micro-action recognition of subtle, short, and highly ambiguous motions such as glances and nods.
  • B-MoE assigns different experts to distinct body regions (head, body, upper limbs, lower limbs) and uses a cross-attention routing mechanism to learn inter-region relationships and dynamically select informative regions for each action.
  • It leverages a lightweight Macro-Micro Motion Encoder (M3E) with a dual-stream design that fuses region-specific semantic cues with global motion features to capture both long-range context and fine-grained local motion.
  • Experiments on MA-52, SocialGesture, and MPII-GroupInteraction report consistent state-of-the-art gains, particularly for ambiguous and low-amplitude classes.

Abstract

Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.