MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MSGL-Transformer, a lightweight multi-scale global-local transformer designed to recognize rodent social behaviors from pose-based temporal sequences while reducing reliance on manual scoring.
  • It uses parallel attention branches covering short-, medium-, and global temporal ranges, plus a Behavior-Aware Modulation (BAM) block to emphasize behavior-relevant temporal embeddings before attention.
  • Experiments on RatSI and CalMS21 show strong performance, reaching 75.4% mean accuracy/F1=0.745 on RatSI and 87.1% accuracy/F1=0.8745 on CalMS21.
  • Results indicate MSGL-Transformer outperforms several baselines (e.g., TCN, LSTM variants, and multiple pose/action recognition architectures) and transfers across datasets with only input dimensionality and class-count adjustments.

Abstract

Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.