Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

arXiv cs.CV / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a lightweight, architecture-agnostic training framework for Temporal Action Segmentation (TAS) that targets fine-grained boundary localization without adding heavy model components.
  • It uses two auxiliary losses—(1) a boundary-regression loss via a single extra output channel for temporal boundary accuracy, and (2) a CDF-based segment-level regularization loss to improve within-segment coherence.
  • The method can be plugged into existing TAS models (such as MS-TCN, C2F-TCN, and FACT) purely as a training-time loss, requiring minimal architectural changes.
  • Experiments on three benchmark datasets show consistent gains in segment-level metrics (higher F1 and Edit scores) across multiple base models, while frame-wise accuracy remains largely unaffected.
  • Overall, the work argues that improved segmentation quality can be achieved primarily through simple loss design rather than more complex architectures or inference-time refinements.

Abstract

Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.