Boundary-Centric Active Learning for Temporal Action Segmentation

arXiv cs.CV / 4/17/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper addresses temporal action segmentation (TAS), arguing that most labeling effort on untrimmed videos goes to action transitions where small timing errors heavily hurt segmentation metrics.
  • It proposes B-ACT, a clip-budgeted active learning framework that focuses supervision on high-leverage boundary regions using predictive uncertainty and a boundary score combining neighborhood uncertainty, class ambiguity, and temporal dynamics.
  • B-ACT uses a hierarchical two-stage loop: first selecting unlabeled videos to query, then selecting the top-K candidate transition boundaries within each chosen video for labeling.
  • The annotation strategy requests labels only for boundary frames while still training on boundary-centered clips to leverage temporal context from the model’s receptive field.
  • Experiments on GTEA, 50Salads, and Breakfast show B-ACT achieves better label efficiency than prior TAS active learning baselines and prior state-of-the-art methods under sparse labeling budgets, especially where boundary placement drives F1 scores.

Abstract

Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-K boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.